🤖 AI Summary
This paper addresses the lack of systematic reliability assessment methods for large language models (LLMs) in binary text classification. Methodologically, it introduces the first psychometrically grounded framework for evaluating LLM classification reliability—pioneering the application of classical test theory to LLM evaluation. It quantifies response consistency, invalid-response rates, and intra- and inter-annotator reliability; implements multi-round repeated inference (five runs per model), sample-size planning guidelines, and an invalid-response detection metric. Experiments span 14 state-of-the-art LLMs—including GPT-4o, Claude-3, and Llama-3.2—on 1,350 financial news articles. Results show within-model consistency of 90–98% and accuracy of 0.76–0.88; smaller models (e.g., gemma-3:1B, claude-3-5-haiku) achieve the highest performance; yet none surpass random chance in predicting actual market trends.
📝 Abstract
This study introduces a framework for evaluating consistency in large language model (LLM) binary text classification, addressing the lack of established reliability assessment methods. Adapting psychometric principles, we determine sample size requirements, develop metrics for invalid responses, and evaluate intra- and inter-rater reliability. Our case study examines financial news sentiment classification across 14 LLMs (including claude-3-7-sonnet, gpt-4o, deepseek-r1, gemma3, llama3.2, phi4, and command-r-plus), with five replicates per model on 1,350 articles. Models demonstrated high intra-rater consistency, achieving perfect agreement on 90-98% of examples, with minimal differences between expensive and economical models from the same families. When validated against StockNewsAPI labels, models achieved strong performance (accuracy 0.76-0.88), with smaller models like gemma3:1B, llama3.2:3B, and claude-3-5-haiku outperforming larger counterparts. All models performed at chance when predicting actual market movements, indicating task constraints rather than model limitations. Our framework provides systematic guidance for LLM selection, sample size planning, and reliability assessment, enabling organizations to optimize resources for classification tasks.