๐ค AI Summary
This paper identifies the pervasive โalgorithmic adaptivity biasโ in online A/B testing of recommender systems: because production models continuously shape user behavior distributions, new models deployed to small-traffic experimental arms are systematically underestimated in performance, leading to frequent misidentification of superior variants. We formally define this bias and integrate it into the theoretical framework of evaluation bias in recommender systems. Leveraging causal inference, distribution shift modeling, and large-scale online experimentation data, we quantitatively analyze the underlying flywheel mechanism. We propose an end-to-end solution spanning experimental design, effect measurement, and bias correction. Empirical results demonstrate that this bias substantially distorts small-traffic A/B test outcomes; applying our correction significantly improves variant identification accuracy, establishing a new paradigm for robust online evaluation.
๐ Abstract
Online experiments (A/B tests) are widely regarded as the gold standard for evaluating recommender system variants and guiding launch decisions. However, a variety of biases can distort the results of the experiment and mislead decision-making. An underexplored but critical bias is algorithm adaptation effect. This bias arises from the flywheel dynamics among production models, user data, and training pipelines: new models are evaluated on user data whose distributions are shaped by the incumbent system or tested only in a small treatment group. As a result, the measured effect of a new product change in modeling and user experience in this constrained experimental setting can diverge substantially from its true impact in full deployment. In practice, the experiment results often favor the production variant with large traffic while underestimating the performance of the test variant with small traffic, which leads to missing opportunities to launch a true winning arm or underestimating the impact. This paper aims to raise awareness of algorithm adaptation bias, situate it within the broader landscape of RecSys evaluation biases, and motivate discussion of solutions that span experiment design, measurement, and adjustment. We detail the mechanisms of this bias, present empirical evidence from real-world experiments, and discuss potential methods for a more robust online evaluation.