Offline-to-Online Learning in Linear Bandits

📅 2026-06-02
📈 Citations: 0
Influential: 0
📄 PDF

career value

245K/year
🤖 AI Summary
This work addresses the challenge of integrating offline data with online learning in stochastic linear bandits by proposing a novel algorithm that initially leverages offline data to form a prior and subsequently enhances online exploration in an adaptive manner, dynamically balancing the contributions of both sources. Within the structured linear bandit setting, the method is the first to simultaneously outperform purely online and purely offline strategies, achieving a sublinear regret bound with respect to the optimal action. Notably, the regret decreases as the amount of offline data increases. Theoretical analysis establishes a rigorous upper bound on regret, and extensive experiments demonstrate that the proposed approach significantly outperforms existing baselines across various settings.
📝 Abstract
We study online learning with an additional offline dataset in the stochastic linear bandit setting. Although this problem arises frequently in practice, the offline-to-online tradeoff remains poorly understood in structured environments. We propose a linear bandit algorithm that balances this tradeoff: it relies on offline data during early rounds, and increasingly favors exploration as the horizon grows. We establish regret bounds showing that our method is simultaneously competitive with both purely online and purely offline solutions. In particular, it achieves sublinear regret relative to the optimal action in the number of online interactions, while its regret relative to an offline reference decreases as the number of offline samples grows. Empirical results further demonstrate its effectiveness across various problem parameters.
Problem

Research questions and friction points this paper is trying to address.

offline-to-online learning
linear bandits
stochastic bandits
regret analysis
offline dataset
Innovation

Methods, ideas, or system contributions that make the work stand out.

offline-to-online learning
linear bandits
regret bounds
exploration-exploitation tradeoff
stochastic bandits
🔎 Similar Papers
2024-05-04International Conference on Machine LearningCitations: 12