๐ค AI Summary
In high-stakes AI decision-making, jointly ensuring statistical reliability and optimizing performance during multi-objective hyperparameter selection remains challenging. Method: This paper proposes RG-PTโa novel framework that (i) introduces a Reliability Graph (RG), a directed acyclic graph modeling reliability dependencies among hyperparameters; and (ii) tightly integrates the BradleyโTerry pairwise comparison model with false discovery rate (FDR) control to enable parallel hypothesis testing across hyperparameters within the same reliability tier, overcoming efficiency and robustness limitations of conventional sequential Pareto testing. Results: Evaluated on multiple real-world tasks, RG-PT significantly improves Pareto front quality under identical reliability constraints, achieves higher calibration accuracy, boosts validation efficiency by 3.2ร, and strictly controls FDR at the pre-specified threshold.
๐ Abstract
In sensitive application domains, multi-objective hyperparameter selection can ensure the reliability of AI models prior to deployment, while optimizing auxiliary performance metrics. The state-of-the-art Pareto Testing (PT) method guarantees statistical reliability constraints by adopting a multiple hypothesis testing framework. In PT, hyperparameters are validated one at a time, following a data-driven order determined by expected reliability levels. This paper introduces a novel framework for multi-objective hyperparameter selection that captures the interdependencies among the reliability levels of different hyperparameter configurations using a directed acyclic graph (DAG), which is termed the reliability graph (RG). The RG is constructed based on prior information and data by using the Bradley-Terry model. The proposed approach, RG-based PT (RG-PT), leverages the RG to enable the efficient, parallel testing of multiple hyperparameters at the same reliability level. By integrating False Discovery Rate (FDR) control, RG-PT ensures robust statistical reliability guarantees and is shown via experiments across diverse domains to consistently yield superior solutions for multi-objective calibration problems.