Confidence Calibration under Ambiguous Ground Truth

πŸ“… 2026-03-24
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limitation of conventional confidence calibration methods, which assume each sample possesses a single ground-truth labelβ€”an assumption that breaks down under substantial annotator disagreement, leading to biased calibration. To overcome this, the authors propose an annotator-disagreement-aware post-hoc calibration approach that optimizes proper scoring rules tailored to the full label distribution, thereby improving calibration without requiring model retraining. The study further uncovers, for the first time, the systematic bias introduced by majority-vote labels and introduces a multi-level weakly supervised calibration strategy, demonstrating that pre-aggregated label distributions are unnecessary. Evaluated on four benchmarks with real or multi-source annotations, the proposed Dirichlet-Soft reduces expected calibration error (ECE) relative to the true label distribution by 55–87%, while LS-TS achieves reductions of 9–77% even without access to annotator-level information.

Technology Category

Application Category

πŸ“ Abstract
Confidence calibration assumes a unique ground-truth label per input, yet this assumption fails wherever annotators genuinely disagree. Post-hoc calibrators fitted on majority-voted labels, the standard single-label targets used in practice, can appear well-calibrated under conventional evaluation yet remain substantially miscalibrated against the underlying annotator distribution. We show that this failure is structural: under simplifying assumptions, Temperature Scaling is biased toward temperatures that underestimate annotator uncertainty, with true-label miscalibration increasing monotonically with annotation entropy. To address this, we develop a family of ambiguity-aware post-hoc calibrators that optimise proper scoring rules against the full label distribution and require no model retraining. Our methods span progressively weaker annotation requirements: Dirichlet-Soft leverages the full annotator distribution and achieves the best overall calibration quality across settings; Monte Carlo Temperature Scaling with a single annotation per example (MCTS S=1) matches full-distribution calibration across all benchmarks, demonstrating that pre-aggregated label distributions are unnecessary; and Label-Smooth Temperature Scaling (LS-TS) operates with voted labels alone by constructing data-driven pseudo-soft targets from the model's own confidence. Experiments on four benchmarks with real multi-annotator distributions (CIFAR-10H, ChaosNLI) and clinically-informed synthetic annotations (ISIC~2019, DermaMNIST) show that Dirichlet-Soft reduces true-label ECE by 55-87% relative to Temperature Scaling, while LS-TS reduces ECE by 9-77% without any annotator data.
Problem

Research questions and friction points this paper is trying to address.

confidence calibration
ambiguous ground truth
annotation disagreement
label distribution
miscalibration
Innovation

Methods, ideas, or system contributions that make the work stand out.

confidence calibration
ambiguous ground truth
annotation disagreement
proper scoring rules
post-hoc calibration
πŸ”Ž Similar Papers