Offline-to-Online Learning in Linear Bandits

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

245K/year

🤖 AI Summary

This work addresses the challenge of integrating offline data with online learning in stochastic linear bandits by proposing a novel algorithm that initially leverages offline data to form a prior and subsequently enhances online exploration in an adaptive manner, dynamically balancing the contributions of both sources. Within the structured linear bandit setting, the method is the first to simultaneously outperform purely online and purely offline strategies, achieving a sublinear regret bound with respect to the optimal action. Notably, the regret decreases as the amount of offline data increases. Theoretical analysis establishes a rigorous upper bound on regret, and extensive experiments demonstrate that the proposed approach significantly outperforms existing baselines across various settings.

📝 Abstract

We study online learning with an additional offline dataset in the stochastic linear bandit setting. Although this problem arises frequently in practice, the offline-to-online tradeoff remains poorly understood in structured environments. We propose a linear bandit algorithm that balances this tradeoff: it relies on offline data during early rounds, and increasingly favors exploration as the horizon grows. We establish regret bounds showing that our method is simultaneously competitive with both purely online and purely offline solutions. In particular, it achieves sublinear regret relative to the optimal action in the number of online interactions, while its regret relative to an offline reference decreases as the number of offline samples grows. Empirical results further demonstrate its effectiveness across various problem parameters.

Problem

Research questions and friction points this paper is trying to address.

offline-to-online learning

linear bandits

stochastic bandits

regret analysis

offline dataset

Innovation

Methods, ideas, or system contributions that make the work stand out.

offline-to-online learning

linear bandits

regret bounds