🤖 AI Summary
This work addresses the challenge of efficiently estimating average loss or subgroup metrics in hierarchical data pools when labels or human evaluations are costly. The authors propose TS-Neyman, a novel method that integrates Thompson sampling with classical Neyman optimal allocation for the first time. By modeling variance uncertainty through an inverse-chi-squared posterior and constructing a sequential sampling strategy based on one-step marginal variance reduction, TS-Neyman preserves the theoretical optimality of Neyman allocation while achieving asymptotic optimality and almost sure convergence. Empirical results demonstrate that TS-Neyman attains near-oracle relative efficiency (within 15%) across diverse benchmarks and real-data replay settings, significantly outperforming equal allocation and plug-in greedy approaches, particularly under sparse pilot data conditions.
📝 Abstract
Many model evaluation tasks reduce to estimating an average loss, error rate, or subgroup metric on a stratified pool when each label, human rating, or simulator call is costly. The precision-optimal Neyman allocation depends on within-stratum variances, which must be learned from the same observations used for estimation. We formulate this as a sequential allocation problem and use the exact one-step marginal variance reduction as the priority index. Replacing the unknown variances by independent inverse-chi-squared posterior draws yields TS-Neyman, a Thompson-sampling rule that preserves the oracle marginal-gain structure while randomizing over variance uncertainty. For any fixed finite number of strata, we prove almost-sure convergence of the TS-Neyman allocation proportions to the Neyman target, asymptotic optimality of the variance proxy, and a central limit theorem for the resulting adaptive stratified estimator. In two five-stratum budget-scaling benchmarks, one bounded-loss benchmark and one binary model-error benchmark in the spirit of Dai et al. 2023, TS-Neyman's relative efficiency stays within 5 percent of the oracle on the bounded-loss population and within about 15 percent on the binary benchmark. In an additional CivilComments real-data replay with confidence-based strata, it stays within about 8 percent of the oracle and improves on equal allocation by roughly 7 to 14 percent in MSE across budgets, while plug-in greedy and two-stage plug-in can degrade by over an order of magnitude under sparse pilots. Common-pilot warm-start and prior-sensitivity studies show that this behavior is stable under working-model and working-prior misspecification.