Bag of Coins: A Statistical Probe into Neural Confidence Structures

📅 2025-07-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Poor confidence calibration in modern neural networks hinders their trustworthy deployment in high-stakes applications. To address this, we propose Bag-of-Coins (BoC), a nonparametric statistical probe that frames confidence assessment as a one-vs.-one competitive hypothesis test—directly analyzing the internal consistency of classifier logits without retraining. BoC constructs a binary competition mechanism grounded in softmax probabilities, uncovering fundamental differences in confidence structure between Vision Transformers (ViTs) and CNNs: ViTs exhibit higher intrinsic consistency, whereas CNNs like ResNet harbor deep-seated inconsistencies overlooked by conventional metrics such as Expected Calibration Error (ECE). On ViT models, BoC achieves an ECE of 0.0212—88% lower than temperature scaling—establishing a novel, interpretable, and training-free paradigm for model calibration diagnostics.

Technology Category

Application Category

📝 Abstract
Modern neural networks, despite their high accuracy, often produce poorly calibrated confidence scores, limiting their reliability in high-stakes applications. Existing calibration methods typically post-process model outputs without interrogating the internal consistency of the predictions themselves. In this work, we introduce a novel, non-parametric statistical probe, the Bag-of-Coins (BoC) test, that examines the internal consistency of a classifier's logits. The BoC test reframes confidence estimation as a frequentist hypothesis test: does the model's top-ranked class win 1-v-1 contests against random competitors at a rate consistent with its own stated softmax probability? When applied to modern deep learning architectures, this simple probe reveals a fundamental dichotomy. On Vision Transformers (ViTs), the BoC output serves as a state-of-the-art confidence score, achieving near-perfect calibration with an ECE of 0.0212, an 88% improvement over a temperature-scaled baseline. Conversely, on Convolutional Neural Networks (CNNs) like ResNet, the probe reveals a deep inconsistency between the model's predictions and its internal logit structure, a property missed by traditional metrics. We posit that BoC is not merely a calibration method, but a new diagnostic tool for understanding and exposing the differing ways that popular architectures represent uncertainty.
Problem

Research questions and friction points this paper is trying to address.

Examines neural networks' internal confidence consistency
Proposes Bag-of-Coins test for logit consistency
Reveals calibration differences between ViTs and CNNs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Non-parametric probe tests classifier logit consistency
Frequentist hypothesis test reframes confidence estimation
Reveals calibration dichotomy between ViTs and CNNs
🔎 Similar Papers
No similar papers found.