🤖 AI Summary
In unbounded A/B testing—where the stopping time is unspecified a priori—there exists a fundamental tension between enabling early stopping and ensuring late-stage detection of statistically significant effects.
Method: This paper proposes a sequential monitoring framework based on repeated significance testing. We theoretically establish that, under the unbounded setting, maintaining a strict constant significance level is infeasible but can be arbitrarily approximated. Leveraging this insight, we construct an adaptive p-value boundary that dynamically controls the family-wise Type I error rate in a data-driven manner, eliminating dependence on prespecified sample size or effect magnitude. The method integrates sequential analysis with statistical generalization bounds to ensure both high statistical power and real-time decision-making capability.
Results: Empirical evaluation demonstrates that our framework significantly outperforms classical fixed-sample and SPRT approaches in robustness, sensitivity to early signals, and power for late-stage discovery—without requiring a holdout validation set.
📝 Abstract
Requiring statistical significance at multiple interim analyses to declare a statistically significant result for an AB test allows less stringent requirements for significance at each interim analysis. Repeated repeated significance competes well with methods built on assumptions about the test -- assumptions that may be impossible to evaluate a priori and may require extra data to evaluate empirically. Instead, requiring repeated significance allows the data itself to prove directly that the required results are not due to chance alone. We explain how to apply tests with repeated significance to continuously monitor unbounded tests -- tests that do not have an a priori bound on running time or number of observations. We show that it is impossible to maintain a constant requirement for significance for unbounded tests, but that we can come arbitrarily close to that goal.