When an LLM is apprehensive about its answers -- and when its uncertainty is justified

📅 2025-03-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of uncertainty quantification for large language models (LLMs) in high-stakes multiple-choice question answering. It systematically evaluates token-level entropy and model self-judgment (MASJ) as error predictors across knowledge-intensive (e.g., biology) and reasoning-intensive (e.g., mathematics) tasks. Results show entropy achieves strong predictive performance in knowledge domains (ROC AUC = 0.73) but fails in reasoning-heavy ones (AUC = 0.55), revealing an implicit dependence on reasoning load; MASJ performs near chance overall. Further analysis uncovers reasoning-load bias across subdomains in the MMLU-Pro benchmark, compromising evaluation fairness. The paper thus advocates integrating data uncertainty modeling and proposes a restructured MMLU-Pro with balanced reasoning difficulty across topics. This provides both methodological insights for LLM uncertainty assessment and a concrete pathway for benchmark refinement.

Technology Category

Application Category

📝 Abstract
Uncertainty estimation is crucial for evaluating Large Language Models (LLMs), particularly in high-stakes domains where incorrect answers result in significant consequences. Numerous approaches consider this problem, while focusing on a specific type of uncertainty, ignoring others. We investigate what estimates, specifically token-wise entropy and model-as-judge (MASJ), would work for multiple-choice question-answering tasks for different question topics. Our experiments consider three LLMs: Phi-4, Mistral, and Qwen of different sizes from 1.5B to 72B and $14$ topics. While MASJ performs similarly to a random error predictor, the response entropy predicts model error in knowledge-dependent domains and serves as an effective indicator of question difficulty: for biology ROC AUC is $0.73$. This correlation vanishes for the reasoning-dependent domain: for math questions ROC-AUC is $0.55$. More principally, we found out that the entropy measure required a reasoning amount. Thus, data-uncertainty related entropy should be integrated within uncertainty estimates frameworks, while MASJ requires refinement. Moreover, existing MMLU-Pro samples are biased, and should balance required amount of reasoning for different subdomains to provide a more fair assessment of LLMs performance.
Problem

Research questions and friction points this paper is trying to address.

Evaluating uncertainty in LLMs for high-stakes domains
Assessing token-wise entropy and MASJ for multiple-choice tasks
Addressing bias in MMLU-Pro samples for fair LLM assessment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Token-wise entropy predicts model error effectively.
Model-as-judge (MASJ) needs refinement for accuracy.
Data-uncertainty entropy integrated in uncertainty frameworks.
🔎 Similar Papers
No similar papers found.