🤖 AI Summary
Existing approaches to evaluating predictive uncertainty lack a unified, comparable benchmark, making it difficult to jointly assess accuracy–confidence trade-offs. Method: We propose the first general-purpose evaluation framework specifically designed for epistemic uncertainty, supporting diverse output paradigms—including point estimates, probabilistic forecasts, prediction sets, and credal sets—and introduce a configurable trade-off parameter to enable fair, application-oriented comparison across heterogeneous models (e.g., Bayesian neural networks, evidential deep learning, and belief function-based methods). By integrating principles from uncertainty quantification theory and classification evaluation, we design task-adaptive performance metrics. Results: Experiments on CIFAR-10, MNIST, and CIFAR-100 demonstrate that our framework yields consistent, highly discriminative metrics that accurately characterize model behavior along the accuracy–confidence trade-off curve.
📝 Abstract
Predictions of uncertainty-aware models are diverse, ranging from single point estimates (often averaged over prediction samples) to predictive distributions, to set-valued or credal-set representations. We propose a novel unified evaluation framework for uncertainty-aware classifiers, applicable to a wide range of model classes, which allows users to tailor the trade-off between accuracy and precision of predictions via a suitably designed performance metric. This makes possible the selection of the most suitable model for a particular real-world application as a function of the desired trade-off. Our experiments, concerning Bayesian, ensemble, evidential, deterministic, credal and belief function classifiers on the CIFAR-10, MNIST and CIFAR-100 datasets, show that the metric behaves as desired.