🤖 AI Summary
This work identifies systematic deficiencies in large language models’ (LLMs) probabilistic belief representation: their outputs frequently violate fundamental probability axioms—including the law of total probability and Bayesian updating—resulting in logically inconsistent and miscalibrated confidence estimates, thereby undermining their reliability in trustworthy decision-making and interpretable reasoning. To address this, the authors introduce the first benchmark dataset of statements with ground-truth uncertainty, accompanied by a multidimensional evaluation framework that assesses statement-level uncertainty annotation, confidence calibration, probabilistic logical consistency, and deviation from Bayesian updating. They conduct zero-shot and prompt-engineering evaluations across leading closed- and open-source LLMs. Experimental results demonstrate that current LLMs exhibit substantial deviations from rational probabilistic norms, with calibration errors significantly exceeding those of conventional statistical models—highlighting the urgent need for dedicated probabilistic cognitive modeling approaches.
📝 Abstract
Advances in the general capabilities of large language models (LLMs) have led to their use for information retrieval, and as components in automated decision systems. A faithful representation of probabilistic reasoning in these models may be essential to ensure trustworthy, explainable and effective performance in these tasks. Despite previous work suggesting that LLMs can perform complex reasoning and well-calibrated uncertainty quantification, we find that current versions of this class of model lack the ability to provide rational and coherent representations of probabilistic beliefs. To demonstrate this, we introduce a novel dataset of claims with indeterminate truth values and apply a number of well-established techniques for uncertainty quantification to measure the ability of LLM's to adhere to fundamental properties of probabilistic reasoning.