Calibrating Overconfidence Without Sacrificing Confidence: Probe-Conditioned Head Intervention for LLMs

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the tendency of large language models to exhibit overconfidence in incorrect predictions, a problem exacerbated by existing calibration methods that often undermine well-calibrated confidence in correct answers. The authors propose a novel inference-time, conditionally applied intervention—Probe-frozen Head Intervention (PCHI)—which enables targeted modulation of internal attention mechanisms for the first time. By leveraging a frozen probe to identify high-confidence erroneous predictions, PCHI selectively rescales the outputs of attention heads associated with the readout token and its upstream template tokens. Evaluated on Qwen3-4B-Instruct, this method successfully reverts 82.2% of overconfident errors to low-confidence responses, reducing Expected Calibration Error (ECE) from 21.9% to 9.2%, while adversely affecting only 5.1% of originally correct high-confidence predictions. Further experiments on Gemma3-4B confirm the generalizability and efficacy of the proposed intervention strategy.

📝 Abstract

Large language models often express high confidence in answers that are wrong. Standard calibration remedies typically act globally or at the score level, reducing unwarranted confidence but also risking erosion of warranted confidence on correct answers. We introduce Probe-Conditioned Head Intervention (PCHI), an inference-time method that uses a frozen probe to detect likely wrong-but-confident responses and conditionally rescales downstream attention-head outputs during confidence generation. On Qwen3-4B-Instruct solving OpenMathInstruct problems with a structured binary confidence field, readout-token PCHI converts 82.2% of originally wrong-yes confidence readouts to $\texttt{no}$, while a joint intervention across upstream confidence-template tokens reduces ECE from 21.9% to 9.2% and damages only 5.1% of originally correct-yes readouts. The readout-token effect also appears on Gemma3-4B, though upstream interventions are weaker and more mask-dependent. These results show that verbalized overconfidence can be selectively reduced through conditionally applied internal intervention, partially decoupling the suppression of unwarranted confidence from the loss of warranted confidence.

Problem

Research questions and friction points this paper is trying to address.

overconfidence

calibration

large language models

confidence estimation

ECE

Innovation

Methods, ideas, or system contributions that make the work stand out.

calibration

overconfidence

attention intervention