🤖 AI Summary
This study addresses the limited statistical power of sequential hypothesis testing under label shift or concept drift. To overcome this challenge, the authors propose a semi-supervised testing framework based on predictive betting. The method leverages unlabeled data to construct e-statistics that maintain valid Type-I error control at any stopping time, even when predictive models are inaccurate. As the first work to integrate predictive betting into semi-supervised hypothesis testing, the proposed approach substantially enhances statistical power with only a limited number of labeled samples. Empirical evaluations on both synthetic data and large language model assessment tasks demonstrate its superior performance over baseline methods such as prediction-powered inference, with consistent robustness even when unlabeled data are scarce or predictor quality is low.
📝 Abstract
We introduce a testing-by-betting framework that leverages predictions on unlabeled data to enhance the power of sequential hypothesis testing. Given limited samples from the joint distribution of $(X,Y)$, and additional unlabeled samples from the marginal of $X$, we ask how unlabeled data can be used to hypothesize about the distribution of $Y$, and the conditional distribution of $Y\mid X$. We introduce an e-statistic and use it to construct a sequential test. Under standard distributional assumptions -- label shift or concept shift -- we establish that the test is anytime valid. Furthermore, we show that for binary data, the e-statistic has non-trivial power. Crucially, our approach retains these properties even when the underlying predictions are inaccurate. Through simulations and applications to large language models evaluation, we demonstrate power gains over baseline approaches, including prediction-powered inference. These gains persist even with relatively limited unlabeled data and when predictions have low accuracy due to weak correlation between $X$ and $Y$.