🤖 AI Summary
This study addresses the challenge of high variance and low statistical power in online ranking experiments, where revenue-related metrics often exhibit heavy-tailed distributions—particularly problematic under limited traffic conditions. The authors propose a novel integration of post-stratification with the CUPED (Controlled-experiment Using Pre-Experiment Data) method, leveraging pre-experiment covariates to jointly reduce variance for heavy-tailed reward metrics. This approach significantly enhances experimental sensitivity without requiring additional traffic. Empirical deployment at ShareChat demonstrates substantial variance reduction, yielding approximately a 45% decrease in the required sample size to achieve equivalent statistical confidence. The work also provides a systematic characterization of the method’s applicability conditions and practical implementation guidelines.
📝 Abstract
Online evaluation of ranking and retrieval systems often relies on downstream monetization metrics such as app revenue or creator earnings. These metrics are typically heavy-tailed, with a small fraction of users dominating both mean and variance, leading to low statistical power and unreliable conclusions in A/B experiments -- especially under limited traffic.
We present a practical framework for variance reduction in online experiments by combining post-stratification with CUPED. Our approach leverages pre-experiment covariates to improve the sensitivity of monetization experiments without requiring additional traffic. Deployed at ShareChat across ranking-driven monetization experiments, the method substantially reduces variance and improves decision stability, achieving equivalent statistical confidence with ~45\% less traffic than standard metrics. We further discuss practical design choices, guardrails, and limitations, providing guidance on when post-stratification is appropriate for real-world information retrieval and Recommendation systems.