🤖 AI Summary
This work addresses the challenge of uncertainty quantification for large language models (LLMs) in high-stakes multiple-choice question answering. It systematically evaluates token-level entropy and model self-judgment (MASJ) as error predictors across knowledge-intensive (e.g., biology) and reasoning-intensive (e.g., mathematics) tasks. Results show entropy achieves strong predictive performance in knowledge domains (ROC AUC = 0.73) but fails in reasoning-heavy ones (AUC = 0.55), revealing an implicit dependence on reasoning load; MASJ performs near chance overall. Further analysis uncovers reasoning-load bias across subdomains in the MMLU-Pro benchmark, compromising evaluation fairness. The paper thus advocates integrating data uncertainty modeling and proposes a restructured MMLU-Pro with balanced reasoning difficulty across topics. This provides both methodological insights for LLM uncertainty assessment and a concrete pathway for benchmark refinement.
📝 Abstract
Uncertainty estimation is crucial for evaluating Large Language Models (LLMs), particularly in high-stakes domains where incorrect answers result in significant consequences. Numerous approaches consider this problem, while focusing on a specific type of uncertainty, ignoring others. We investigate what estimates, specifically token-wise entropy and model-as-judge (MASJ), would work for multiple-choice question-answering tasks for different question topics. Our experiments consider three LLMs: Phi-4, Mistral, and Qwen of different sizes from 1.5B to 72B and $14$ topics. While MASJ performs similarly to a random error predictor, the response entropy predicts model error in knowledge-dependent domains and serves as an effective indicator of question difficulty: for biology ROC AUC is $0.73$. This correlation vanishes for the reasoning-dependent domain: for math questions ROC-AUC is $0.55$. More principally, we found out that the entropy measure required a reasoning amount. Thus, data-uncertainty related entropy should be integrated within uncertainty estimates frameworks, while MASJ requires refinement. Moreover, existing MMLU-Pro samples are biased, and should balance required amount of reasoning for different subdomains to provide a more fair assessment of LLMs performance.