🤖 AI Summary
This work investigates confidence calibration of large language models (LLMs) on multiple-choice tasks, revealing that overconfidence arises from the interplay of model scale, distractor design, and question type. Methodologically, we propose the first actionable taxonomy of calibration failure modes—moving beyond correlation-based analysis to precisely characterize conditions that exacerbate overconfidence—and introduce a standardized multiple-choice benchmark featuring confidence–accuracy alignment analysis, quantitative distractor sensitivity measurement, and cross-scale comparative evaluation. Key results show: (1) distractor sensitivity increases with model scale; (2) while GPT-4o achieves superior overall calibration, its overconfidence intensifies markedly under adversarial distractors; and (3) smaller models improve accuracy via option-aware reasoning but degrade uncertainty estimation. These findings provide critical empirical grounding for calibration-aware interventions in LLM deployment.
📝 Abstract
Large Language Models (LLMs) demonstrate impressive performance across diverse tasks, yet confidence calibration remains a challenge. Miscalibration - where models are overconfident or underconfident - poses risks, particularly in high-stakes applications. This paper presents an empirical study on LLM calibration, examining how model size, distractors, and question types affect confidence alignment. We introduce an evaluation framework to measure overconfidence and investigate whether multiple-choice formats mitigate or worsen miscalibration. Our findings show that while larger models (e.g., GPT-4o) are better calibrated overall, they are more prone to distraction, whereas smaller models benefit more from answer choices but struggle with uncertainty estimation. Unlike prior work, which primarily reports miscalibration trends, we provide actionable insights into failure modes and conditions that worsen overconfidence. These findings highlight the need for calibration-aware interventions and improved uncertainty estimation methods.