Leveraging (Biased) Information: Multi-armed Bandits with Offline Data

📅 2024-05-04

🏛️ International Conference on Machine Learning

📈 Citations: 12

✨ Influential: 3

career value

252K/year

🤖 AI Summary

This paper addresses the problem of safely and efficiently improving online learning performance for stochastic multi-armed bandits (MAB) and combinatorial MAB under offline–online distribution shift. To tackle distribution mismatch between offline and online data, we propose MIN-UCB—a novel adaptive algorithm that automatically discards low-quality offline data when the shift magnitude is unknown, ensuring safety, and actively leverages bounded-shift offline information to substantially reduce regret when the shift is bounded. We establish, for the first time, a theoretical lower bound on regret for MAB with biased offline data. We prove that MIN-UCB achieves tight instance-dependent and instance-independent regret bounds—strictly outperforming classical UCB. Both theoretical analysis and numerical experiments demonstrate its robustness and superiority across diverse shift regimes.

Technology Category

Application Category

📝 Abstract

We leverage offline data to facilitate online learning in stochastic multi-armed bandits. The probability distributions that govern the offline data and the online rewards can be different. Without any non-trivial upper bound on their difference, we show that no non-anticipatory policy can outperform the UCB policy by (Auer et al. 2002), even in the presence of offline data. In complement, we propose an online policy MIN-UCB, which outperforms UCB when a non-trivial upper bound is given. MIN-UCB adaptively chooses to utilize the offline data when they are deemed informative, and to ignore them otherwise. MIN-UCB is shown to be tight in terms of both instance independent and dependent regret bounds. Finally, we corroborate the theoretical results with numerical experiments.

Problem

Research questions and friction points this paper is trying to address.

Leveraging offline data to enhance online learning in bandit problems.

Addressing distribution mismatch between offline data and online rewards.

Developing adaptive policies that selectively use informative offline data.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive policy MIN-UCB uses offline data when informative

Generalized MIN-COMB-UCB extends to combinatorial bandit setting

Tight regret bounds achieved under distribution mismatch conditions

🔎 Similar Papers

Artificial Replay: A Meta-Algorithm for Harnessing Historical Data in Bandits