🤖 AI Summary
This work addresses the challenge of integrating offline data with online learning in stochastic linear bandits by proposing a novel algorithm that initially leverages offline data to form a prior and subsequently enhances online exploration in an adaptive manner, dynamically balancing the contributions of both sources. Within the structured linear bandit setting, the method is the first to simultaneously outperform purely online and purely offline strategies, achieving a sublinear regret bound with respect to the optimal action. Notably, the regret decreases as the amount of offline data increases. Theoretical analysis establishes a rigorous upper bound on regret, and extensive experiments demonstrate that the proposed approach significantly outperforms existing baselines across various settings.
📝 Abstract
We study online learning with an additional offline dataset in the stochastic linear bandit setting. Although this problem arises frequently in practice, the offline-to-online tradeoff remains poorly understood in structured environments. We propose a linear bandit algorithm that balances this tradeoff: it relies on offline data during early rounds, and increasingly favors exploration as the horizon grows. We establish regret bounds showing that our method is simultaneously competitive with both purely online and purely offline solutions. In particular, it achieves sublinear regret relative to the optimal action in the number of online interactions, while its regret relative to an offline reference decreases as the number of offline samples grows. Empirical results further demonstrate its effectiveness across various problem parameters.