🤖 AI Summary
This study addresses the alignment between large language models’ (LLMs) inference-time uncertainty and human uncertainty perception, aiming to enhance model controllability and user trust. We propose “uncertainty alignment” as a novel paradigm that transcends conventional calibration—traditionally focused solely on predictive accuracy—and instead evaluates how well diverse uncertainty quantification methods (including classical and improved measures) correlate with crowd-sourced human uncertainty judgments. Using multidimensional metrics—including correctness correlation and distributional similarity—we systematically assess alignment on human-annotated datasets. Results show that certain uncertainty measures, while not reflecting human answer preferences, robustly capture human uncertainty perception; moreover, several measures achieve strong-to-moderate traditional calibration performance *while* aligning well with human judgments. This work broadens the conceptual scope of uncertainty modeling in LLMs and provides interpretable, actionable uncertainty signals for trustworthy AI.
📝 Abstract
There has been much recent interest in evaluating large language models for uncertainty calibration to facilitate model control and modulate user trust. Inference time uncertainty, which may provide a real-time signal to the model or external control modules, is particularly important for applying these concepts to improve LLM-user experience in practice. While many of the existing papers consider model calibration, comparatively little work has sought to evaluate how closely model uncertainty aligns to human uncertainty. In this work, we evaluate a collection of inference-time uncertainty measures, using both established metrics and novel variations, to determine how closely they align with both human group-level uncertainty and traditional notions of model calibration. We find that numerous measures show evidence of strong alignment to human uncertainty, even despite the lack of alignment to human answer preference. For those successful metrics, we find moderate to strong evidence of model calibration in terms of both correctness correlation and distributional analysis.