Asymptotically Optimal Problem-Dependent Bandit Policies for Transfer Learning

📅 2025-09-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper studies the non-contextual multi-armed bandit problem under transfer learning: before the target task begins, the learner observes i.i.d. samples from each source distribution and knows that the distance between the $k$-th source and target distributions satisfies $d_k( u_k, u'_k) leq L_k$. For this setting, we establish, for the first time, a problem-dependent, migration-parameterized asymptotic regret lower bound. Building upon this, we propose KL-UCB-Transfer—a novel algorithm integrating KL-divergence-based upper confidence bounds with transfer priors—to achieve asymptotically optimal cumulative regret in Gaussian environments. The algorithm adaptively estimates distributional shifts using source samples and adjusts confidence intervals accordingly. Experiments demonstrate that when source and target distributions are close, KL-UCB-Transfer significantly outperforms non-transfer baselines and tightly matches the theoretical lower bound.

Technology Category

Application Category

📝 Abstract
We study the non-contextual multi-armed bandit problem in a transfer learning setting: before any pulls, the learner is given N'_k i.i.d. samples from each source distribution nu'_k, and the true target distributions nu_k lie within a known distance bound d_k(nu_k, nu'_k) <= L_k. In this framework, we first derive a problem-dependent asymptotic lower bound on cumulative regret that extends the classical Lai-Robbins result to incorporate the transfer parameters (d_k, L_k, N'_k). We then propose KL-UCB-Transfer, a simple index policy that matches this new bound in the Gaussian case. Finally, we validate our approach via simulations, showing that KL-UCB-Transfer significantly outperforms the no-prior baseline when source and target distributions are sufficiently close.
Problem

Research questions and friction points this paper is trying to address.

Extends multi-armed bandit regret bounds to transfer learning settings
Develops policies using source distribution samples with known distance bounds
Proposes KL-UCB-Transfer algorithm matching derived asymptotic lower bound
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends Lai-Robbins bound with transfer parameters
Proposes KL-UCB-Transfer index policy
Policy matches new bound for Gaussian distributions
🔎 Similar Papers
No similar papers found.
A
Adrien Prevost
Equipe Scool, Univ. Lille, Inria, CNRS, Centrale Lille, UMR 9189- CRIStAL, F-59000 Lille, France
T
Timothee Mathieu
Equipe Scool, Univ. Lille, Inria, CNRS, Centrale Lille, UMR 9189- CRIStAL, F-59000 Lille, France
Odalric-Ambrym Maillard
Odalric-Ambrym Maillard
Inria Lille - Nord Europe
Multi-armed BanditsStochastic Dynamical SystemsStatistical LearningReinforcement LearningRandom matrices