🤖 AI Summary
This work addresses the challenge of certifying the safety of large language models (LLMs) by statistically verifying that their failure rate lies below a specified threshold. A key obstacle is that using LLMs as evaluators introduces noise and bias, compromising statistical validity. To overcome this, the authors propose a “noisy-but-valid” hypothesis testing framework that estimates the evaluator’s true positive rate (TPR) and false positive rate (FPR) via a small human-annotated calibration set, then adjusts the statistical decision threshold accordingly to rigorously control Type I error with limited samples. This approach is the first to systematically handle imperfect evaluators with theoretical guarantees on when noisy evaluation outperforms direct assessment. It introduces the “Oracle Gap” to quantify the cost of parameter estimation and reveals how evaluation efficacy depends on evaluator quality, data scale, and certification requirements. Experiments on Jigsaw Comment, Hate Speech, and SafeRLHF datasets demonstrate superior performance over existing black-box estimators and enable interpretable reliability diagnostics.
📝 Abstract
Reliable certification of Large Language Models (LLMs)-verifying that failure rates are below a safety threshold-is critical yet challenging. While"LLM-as-a-Judge"offers scalability, judge imperfections, noise, and bias can invalidate statistical guarantees. We introduce a"Noisy but Valid"hypothesis testing framework to address this. By leveraging a small human-labelled calibration set to estimate the judge's True Positive and False Positive Rates (TPR/FPR), we derive a variance-corrected critical threshold applied to a large judge-labelled dataset. Crucially, our framework theoretically guarantees finite-sample Type-I error control (validity) despite calibration uncertainty. This distinguishes our work from Prediction-Powered Inference (PPI), positioning our method as a diagnostic tool that explicitly models judge behavior rather than a black-box estimator. Our contributions include: (1) Theoretical Guarantees: We derive the exact conditions under which noisy testing yields higher statistical power than direct evaluation; (2) Empirical Validation: Experiments on Jigsaw Comment, Hate Speech and SafeRLHF confirm our theory; (3) The Oracle Gap: We reveal a significant performance gap between practical methods and the theoretical"Oracle"(perfectly known judge parameters), quantifying the cost of estimation. Specifically, we provide the first systematic treatment of the imperfect-judge setting, yielding interpretable diagnostics of judge reliability and clarifying how evaluation power depends on judge quality, dataset size, and certification levels. Together, these results sharpen understanding of statistical evaluation with LLM judges, and highlight trade-offs among competing inferential tools.