🤖 AI Summary
This study systematically investigates the capability of conformal prediction (CP) to quantify aleatoric uncertainty—specifically, whether CP prediction set size reflects inherent annotation ambiguity arising from class overlap. Using four multi-annotator datasets, we generate prediction sets via eight deep learning models combined with three CP methods. We conduct the first quantitative analysis correlating prediction set size with human annotation distributions—namely, label entropy and inter-annotator agreement. Results reveal that most CP variants exhibit only weak correlation with human labeling diversity (mean |ρ| < 0.2); only isolated configurations achieve moderate correlation (ρ ≈ 0.4–0.5). This indicates a fundamental limitation in CP’s ability to model aleatoric uncertainty. Our findings provide critical empirical evidence delineating the applicability boundaries of CP for uncertainty quantification in settings involving ambiguous or subjective annotations.
📝 Abstract
Conformal prediction is a model-agnostic approach to generating prediction sets that cover the true class with a high probability. Although its prediction set size is expected to capture aleatoric uncertainty, there is a lack of evidence regarding its effectiveness. The literature presents that prediction set size can upper-bound aleatoric uncertainty or that prediction sets are larger for difficult instances and smaller for easy ones, but a validation of this attribute of conformal predictors is missing. This work investigates how effectively conformal predictors quantify aleatoric uncertainty, specifically the inherent ambiguity in datasets caused by overlapping classes. We perform this by measuring the correlation between prediction set sizes and the number of distinct labels assigned by human annotators per instance. We further assess the similarity between prediction sets and human-provided annotations. We use three conformal prediction approaches to generate prediction sets for eight deep learning models trained on four datasets. The datasets contain annotations from multiple human annotators (ranging from five to fifty participants) per instance, enabling the identification of class overlap. We show that the vast majority of the conformal prediction outputs show a very weak to weak correlation with human annotations, with only a few showing moderate correlation. These findings underscore the necessity of critically reassessing the prediction sets generated using conformal predictors. While they can provide a higher coverage of the true classes, their capability in capturing aleatoric uncertainty remains limited.