🤖 AI Summary
This work addresses the lack of reliable uncertainty quantification (UQ) methods in large language models for scientific question answering, which hinders trustworthy reasoning. We introduce the first large-scale UQ evaluation benchmark tailored for reasoning-intensive scientific QA, encompassing 20 models, 7 datasets, and 685,000 long-form answers, along with an open-source, extensible calibration evaluation framework. By integrating prompt engineering with token- and sequence-level UQ approaches—including probabilistic confidence, verbalized uncertainty, and answer consistency—we systematically assess the effectiveness of various UQ metrics. Our analysis reveals that answer frequency (cross-sample consistency) yields the most reliably calibrated sequence-level uncertainty estimates, while verbalized uncertainty exhibits systematic bias. We further demonstrate that instruction tuning and reasoning-focused fine-tuning often induce overconfident predictions, and caution that reliance solely on Expected Calibration Error (ECE) can be misleading.
📝 Abstract
Large Language Models (LLMs) are commonly used in Question Answering (QA) settings, increasingly in the natural sciences if not science at large. Reliable Uncertainty Quantification (UQ) is critical for the trustworthy uptake of generated answers. Existing UQ approaches remain weakly validated in scientific QA, a domain relying on fact-retrieval and reasoning capabilities. We introduce the first large-scale benchmark for evaluating UQ metrics in reasoning-demanding QA studying calibration of UQ methods, providing an extensible open-source framework to reproducibly assess calibration. Our study spans up to 20 large language models of base, instruction-tuned and reasoning variants. Our analysis covers seven scientific QA datasets, including both multiple-choice and arithmetic question answering tasks, using prompting to emulate an open question answering setting. We evaluate and compare methods representative of prominent approaches on a total of 685,000 long-form responses, spanning different reasoning complexities representative of domain-specific tasks. At the token level, we find that instruction tuning induces strong probability mass polarization, reducing the reliability of token-level confidences as estimates of uncertainty. Models further fine-tuned for reasoning are exposed to the same effect, but the reasoning process appears to mitigate it depending on the provider. At the sequence level, we show that verbalized approaches are systematically biased and poorly correlated with correctness, while answer frequency (consistency across samples) yields the most reliable calibration. In the wake of our analysis, we study and report the misleading effect of relying exclusively on ECE as a sole measure for judging performance of UQ methods on benchmark datasets. Our findings expose critical limitations of current UQ methods for LLMs and standard practices in benchmarking thereof.