🤖 AI Summary
In industrial A/B experiments, weak treatment effects often lead to insufficient statistical power. Existing methods leverage only binary trigger observations—i.e., whether outputs differ between groups—ignoring the magnitude of such differences, while full annotation of trigger intensity is prohibitively costly. This paper introduces “trigger intensity” into the A/B evaluation framework for the first time, proposing two estimation paradigms: omniscient (full knowledge) and sampling-based (partial knowledge). We theoretically prove that sampling bias asymptotically vanishes as sample size increases. Our method integrates trigger identification, stratified sampling, bias analysis, and Monte Carlo simulation. Validated on real-world business data, the omniscient approach reduces standard error by 85%, while the sampling-based approach achieves a 36.48% reduction—significantly improving estimation accuracy and statistical power.
📝 Abstract
In industry, online randomized controlled experiment (a.k.a A/B experiment) is a standard approach to measure the impact of a causal change. These experiments have small treatment effect to reduce the potential blast radius. As a result, these experiments often lack statistical significance due to low signal-to-noise ratio. To improve the precision (or reduce standard error), we introduce the idea of trigger observations where the output of the treatment and the control model are different. We show that the evaluation with full information about trigger observations (full knowledge) improves the precision in comparison to a baseline method. However, detecting all such trigger observations is a costly affair, hence we propose a sampling based evaluation method (partial knowledge) to reduce the cost. The randomness of sampling introduces bias in the estimated outcome. We theoretically analyze this bias and show that the bias is inversely proportional to the number of observations used for sampling. We also compare the proposed evaluation methods using simulation and empirical data. In simulation, evaluation with full knowledge reduces the standard error as much as 85%. In empirical setup, evaluation with partial knowledge reduces the standard error by 36.48%.