🤖 AI Summary
This paper addresses the degradation of generalization performance and predictive reliability in active learning due to insufficient model uncertainty calibration. We propose the first active learning strategy that explicitly models calibration error as a core criterion for sample selection. Methodologically, we introduce a kernel-based calibration error estimator into the sampling criterion—prioritizing labeling of unlabeled samples with the highest estimated calibration error. Under the covariate shift assumption, we theoretically establish that this estimator simultaneously bounds calibration error on both the unlabeled pool and the test set. Experiments across diverse pool-based active learning settings demonstrate that our approach significantly reduces both classification error rate and expected calibration error (ECE), consistently outperforming state-of-the-art baselines. Our key contributions are threefold: (i) the first principled integration of calibration-awareness into the active learning objective; (ii) theoretical guarantees on calibration error control under realistic distributional assumptions; and (iii) an efficient, practical implementation grounded in nonparametric estimation.
📝 Abstract
We study the problem of actively learning a classifier with a low calibration error. One of the most popular Acquisition Functions (AFs) in pool-based Active Learning (AL) is querying by the model's uncertainty. However, we recognize that an uncalibrated uncertainty model on the unlabeled pool may significantly affect the AF effectiveness, leading to sub-optimal generalization and high calibration error on unseen data. Deep Neural Networks (DNNs) make it even worse as the model uncertainty from DNN is usually uncalibrated. Therefore, we propose a new AF by estimating calibration errors and query samples with the highest calibration error before leveraging DNN uncertainty. Specifically, we utilize a kernel calibration error estimator under the covariate shift and formally show that AL with this AF eventually leads to a bounded calibration error on the unlabeled pool and unseen test data. Empirically, our proposed method surpasses other AF baselines by having a lower calibration and generalization error across pool-based AL settings.