Reliable Decision Support with LLMs: A Framework for Evaluating Consistency in Binary Text Classification Applications

📅 2025-05-20

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

This paper addresses the lack of systematic reliability assessment methods for large language models (LLMs) in binary text classification. Methodologically, it introduces the first psychometrically grounded framework for evaluating LLM classification reliability—pioneering the application of classical test theory to LLM evaluation. It quantifies response consistency, invalid-response rates, and intra- and inter-annotator reliability; implements multi-round repeated inference (five runs per model), sample-size planning guidelines, and an invalid-response detection metric. Experiments span 14 state-of-the-art LLMs—including GPT-4o, Claude-3, and Llama-3.2—on 1,350 financial news articles. Results show within-model consistency of 90–98% and accuracy of 0.76–0.88; smaller models (e.g., gemma-3:1B, claude-3-5-haiku) achieve the highest performance; yet none surpass random chance in predicting actual market trends.

Technology Category

Application Category

📝 Abstract

This study introduces a framework for evaluating consistency in large language model (LLM) binary text classification, addressing the lack of established reliability assessment methods. Adapting psychometric principles, we determine sample size requirements, develop metrics for invalid responses, and evaluate intra- and inter-rater reliability. Our case study examines financial news sentiment classification across 14 LLMs (including claude-3-7-sonnet, gpt-4o, deepseek-r1, gemma3, llama3.2, phi4, and command-r-plus), with five replicates per model on 1,350 articles. Models demonstrated high intra-rater consistency, achieving perfect agreement on 90-98% of examples, with minimal differences between expensive and economical models from the same families. When validated against StockNewsAPI labels, models achieved strong performance (accuracy 0.76-0.88), with smaller models like gemma3:1B, llama3.2:3B, and claude-3-5-haiku outperforming larger counterparts. All models performed at chance when predicting actual market movements, indicating task constraints rather than model limitations. Our framework provides systematic guidance for LLM selection, sample size planning, and reliability assessment, enabling organizations to optimize resources for classification tasks.

Problem

Research questions and friction points this paper is trying to address.

Evaluating consistency in LLM binary text classification

Developing metrics for invalid responses and reliability

Assessing LLM performance in financial sentiment classification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapts psychometric principles for LLM reliability

Evaluates 14 LLMs with five replicates each

Provides systematic guidance for LLM selection

🔎 Similar Papers

What Did I Do Wrong? Quantifying LLMs' Sensitivity and Consistency to Prompt Engineering