Bandit Simulation for Average Reward Inference

📅 2026-05-30

📈 Citations: 0

✨ Influential: 0

career value

238K/year

🤖 AI Summary

This work addresses the challenge of performing valid statistical inference on average rewards in multi-armed bandit settings after deployment, where adaptive data collection violates the i.i.d. assumption and undermines conventional off-policy evaluation methods. To overcome this limitation, the authors propose the Bandit Simulation for Inference (BSI) framework, which fits a simulator of the bandit environment from observed data to estimate the average reward of any evaluation policy—including black-box adaptive algorithms—and propagates parameter uncertainty from the simulator into confidence interval construction. Requiring only a mild exploratory condition on the behavior policy and eschewing importance weighting, BSI circumvents key constraints of traditional approaches. Theoretical analysis establishes the asymptotic validity of its confidence intervals, and empirical results confirm that BSI maintains nominal coverage even in scenarios where standard methods fail.

📝 Abstract

Multi-arm bandit algorithms are increasingly used in online platforms, clinical trials, and social science experiments, but valid statistical inference on their performance remains an open challenge. After deploying bandits, a natural question is whether one can construct a confidence interval for its mean reward and assess whether it reliably outperforms a baseline policy. The total reward achieved in any single bandit deployment is random, and deploying a bandit twice on the same population typically yields different reward trajectories due to stochastic rewards. Standard statistical inference methods cannot be used because bandit algorithms introduce complex dependencies in the collected data, which violate the i.i.d. assumption underlying many classical approaches. Moreover, existing inference methods for adaptively collected data only apply to estimands that do not depend on the data-collection algorithm (such as the mean reward under a fixed action). We propose Bandit Simulation for Inference (BSI), a framework that fits a simulator of the bandit environment from observed data--either on-policy or off-policy--and uses it to estimate the mean reward under any evaluation policy, including adaptive blackbox algorithms. BSI formally propagates uncertainty in the estimated simulator parameters into the confidence interval construction. Furthermore, for BSI to be valid, it requires only weak exploration assumptions on the behavior policy and avoids importance weighting. We prove that BSI yields asymptotically valid confidence intervals, and demonstrate empirically that it maintains nominal coverage in settings where standard off-policy evaluation methods fail.

Problem

Research questions and friction points this paper is trying to address.

bandit algorithms

statistical inference

average reward

adaptive data collection

confidence interval

Innovation

Methods, ideas, or system contributions that make the work stand out.

bandit simulation

statistical inference

average reward