Bandits for Efficient Experimentation: Adapting to Control Group, Preferences, and Context Drifts

📅 2026-06-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge of online recommendation under heterogeneous user preferences, non-stationary context distributions, and the requirement to consistently outperform a baseline policy. The problem is formulated as a linear contextual multi-armed bandit with non-stationary heteroscedastic noise. We propose the first algorithm that simultaneously handles preference heterogeneity, context drift, and baseline constraints by extending the MED strategy to the linear setting, incorporating variance-aware suboptimality gap estimation and a constraint violation control mechanism. Theoretical analysis establishes an instance-dependent regret bound of Õ(κ/Δ̃·d²·log T) and an expected number of constraint violations bounded by Õ(d). Empirical results demonstrate that the proposed method significantly outperforms conservative baselines that ignore either context drift or preference heterogeneity.

📝 Abstract

We consider a variant of the linear contextual stochastic multi-armed bandits, where the learner must provide recommendations to a group of users, each having its personalized preference vector, and in the presence of context distributions that are drifting over time. Under practitioner-friendly assumptions, we reduce this setting to linear bandit with stationary mean but heteroskedastic and non-stationary noise. We further study the case when the learner must ensure the mean reward of each decision must exceed that of a baseline strategy $\boldsymbolπ_0$ at each decision step. We introduce Dri-MED, an algorithm inspired from the linear version of the MED strategy, and carefully adapted to handle the non-stationary heteroskedastic noise. We show that the instance-dependent regret scales as $\tilde{\mathcal O}\left(\fracκ{\tildeΔ}d^2(\log(T)\right)$, where $\tildeΔ$ is the constraint-aware sub-optimality gap subject to policy $π_0$, with variance-aware multiplicative term $κ$ that we carefully handle using heteroskedastic regression. We further show Dri-MED enjoys $\tilde{\mathcal{O}}(d)$ expected constraint violations. Our numerical results suggest that Dri-MED significantly outperforms conservative baselines that ignores the drift and preference structure.

Problem

Research questions and friction points this paper is trying to address.

contextual bandits

preference heterogeneity

context drift

conservative constraint

non-stationary noise

Innovation

Methods, ideas, or system contributions that make the work stand out.

non-stationary contextual bandits

heteroskedastic noise

conservative bandits