🤖 AI Summary
This paper studies the non-contextual multi-armed bandit problem under transfer learning: before the target task begins, the learner observes i.i.d. samples from each source distribution and knows that the distance between the $k$-th source and target distributions satisfies $d_k(
u_k,
u'_k) leq L_k$. For this setting, we establish, for the first time, a problem-dependent, migration-parameterized asymptotic regret lower bound. Building upon this, we propose KL-UCB-Transfer—a novel algorithm integrating KL-divergence-based upper confidence bounds with transfer priors—to achieve asymptotically optimal cumulative regret in Gaussian environments. The algorithm adaptively estimates distributional shifts using source samples and adjusts confidence intervals accordingly. Experiments demonstrate that when source and target distributions are close, KL-UCB-Transfer significantly outperforms non-transfer baselines and tightly matches the theoretical lower bound.
📝 Abstract
We study the non-contextual multi-armed bandit problem in a transfer learning setting: before any pulls, the learner is given N'_k i.i.d. samples from each source distribution nu'_k, and the true target distributions nu_k lie within a known distance bound d_k(nu_k, nu'_k) <= L_k. In this framework, we first derive a problem-dependent asymptotic lower bound on cumulative regret that extends the classical Lai-Robbins result to incorporate the transfer parameters (d_k, L_k, N'_k). We then propose KL-UCB-Transfer, a simple index policy that matches this new bound in the Gaussian case. Finally, we validate our approach via simulations, showing that KL-UCB-Transfer significantly outperforms the no-prior baseline when source and target distributions are sufficiently close.