Mind the Confidence Gap: Overconfidence, Calibration, and Distractor Effects in Large Language Models

📅 2025-02-16

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work investigates confidence calibration of large language models (LLMs) on multiple-choice tasks, revealing that overconfidence arises from the interplay of model scale, distractor design, and question type. Methodologically, we propose the first actionable taxonomy of calibration failure modes—moving beyond correlation-based analysis to precisely characterize conditions that exacerbate overconfidence—and introduce a standardized multiple-choice benchmark featuring confidence–accuracy alignment analysis, quantitative distractor sensitivity measurement, and cross-scale comparative evaluation. Key results show: (1) distractor sensitivity increases with model scale; (2) while GPT-4o achieves superior overall calibration, its overconfidence intensifies markedly under adversarial distractors; and (3) smaller models improve accuracy via option-aware reasoning but degrade uncertainty estimation. These findings provide critical empirical grounding for calibration-aware interventions in LLM deployment.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) demonstrate impressive performance across diverse tasks, yet confidence calibration remains a challenge. Miscalibration - where models are overconfident or underconfident - poses risks, particularly in high-stakes applications. This paper presents an empirical study on LLM calibration, examining how model size, distractors, and question types affect confidence alignment. We introduce an evaluation framework to measure overconfidence and investigate whether multiple-choice formats mitigate or worsen miscalibration. Our findings show that while larger models (e.g., GPT-4o) are better calibrated overall, they are more prone to distraction, whereas smaller models benefit more from answer choices but struggle with uncertainty estimation. Unlike prior work, which primarily reports miscalibration trends, we provide actionable insights into failure modes and conditions that worsen overconfidence. These findings highlight the need for calibration-aware interventions and improved uncertainty estimation methods.

Problem

Research questions and friction points this paper is trying to address.

Examine LLM confidence calibration issues

Investigate effects of model size, distractors

Propose framework for overconfidence measurement

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM calibration evaluation framework

Multiple-choice formats impact analysis

Calibration-aware intervention insights

🔎 Similar Papers

No similar papers found.

Authors to Follow