🤖 AI Summary
This work addresses the pervasive issue in label ranking models wherein predicted probabilities often misalign with empirical ranking frequencies, reflecting a lack of reliable calibration. It establishes the first systematic theoretical framework for calibration in label ranking, formally defining calibration notions for full rankings, partial rankings, and Top-k rankings, and constructing a hierarchical structure that elucidates their entailment and incomparability relationships. Through probabilistic ranking modeling, theoretical analysis, and empirical evaluation, the study reveals widespread miscalibration among mainstream models. While calibration exhibits a strong yet imperfect correlation with standard accuracy metrics, it captures an essential quality dimension beyond Top-1 accuracy. The work further demonstrates the practical relevance of calibration by integrating it into the evaluation of reward models in reinforcement learning from human feedback (RLHF).
📝 Abstract
Calibration, the alignment of predicted probabilities with true outcome frequencies, is essential for reliable decision-making. While extensively studied for classification and regression, calibration has not been formally addressed for probabilistic label ranking, where the goal is to predict a distribution over orderings of a label set. Naively treating rankings as classes ignores their structure and fails to capture important modalities such as pairwise and top-k predictions. We formalize calibration for label ranking and develop a hierarchy of notions covering full rankings, sub-rankings, and top-k rankings. We prove that full-rank calibration implies the others but not conversely, and sub-ranking and top-k calibration are incomparable. Empirically, we find popular label ranking models are often poorly calibrated, with substantial differences between sub-ranking and top-k metrics. Applying our framework to RLHF reward models, we find that calibration correlates strongly but not perfectly with benchmark accuracy, suggesting it captures a meaningful quality dimension beyond top-1 accuracy. These findings motivate future work on understanding the downstream effects of miscalibration and developing methods to correct it.