Non-Stationary Latent Auto-Regressive Bandits

๐Ÿ“… 2024-02-05
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 1
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work studies the latent-state-driven non-stationary multi-armed bandit problem, where reward means evolve according to an underlying linear autoregressive (AR) dynamicโ€”without requiring a pre-specified non-stationarity budget. We formulate the problem as a linear dynamical system and propose LARL (Latent Autoregressive Learning), a novel algorithm that jointly performs implicit latent-variable inference, approximate steady-state Kalman filtering, and online system parameter estimation to achieve interpretable non-stationarity adaptation. Theoretically, LARL achieves sublinear regret when observation noise variance is small; its regret bound explicitly depends on AR coefficients and observation noise variance, yielding strong interpretability. Empirically, LARL significantly outperforms state-of-the-art non-stationary bandit baselines across diverse benchmarks.

Technology Category

Application Category

๐Ÿ“ Abstract
For the non-stationary multi-armed bandit (MAB) problem, many existing methods allow a general mechanism for the non-stationarity, but rely on a budget for the non-stationarity that is sub-linear to the total number of time steps $T$. In many real-world settings, however, the mechanism for the non-stationarity can be modeled, but there is no budget for the non-stationarity. We instead consider the non-stationary bandit problem where the reward means change due to a latent, auto-regressive (AR) state. We develop Latent AR LinUCB (LARL), an online linear contextual bandit algorithm that does not rely on the non-stationary budget, but instead forms good predictions of reward means by implicitly predicting the latent state. The key idea is to reduce the problem to a linear dynamical system which can be solved as a linear contextual bandit. In fact, LARL approximates a steady-state Kalman filter and efficiently learns system parameters online. We provide an interpretable regret bound for LARL with respect to the level of non-stationarity in the environment. LARL achieves sub-linear regret in this setting if the noise variance of the latent state process is sufficiently small with respect to $T$. Empirically, LARL outperforms various baseline methods in this non-stationary bandit problem.
Problem

Research questions and friction points this paper is trying to address.

Addresses non-stationary multi-armed bandit problem with latent auto-regressive state.
Develops LARL algorithm for online linear contextual bandits without non-stationary budget.
Provides sub-linear regret bound for LARL under small latent state noise variance.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses latent auto-regressive state modeling
Implements online linear contextual bandit algorithm
Approximates steady-state Kalman filter efficiently
๐Ÿ”Ž Similar Papers
No similar papers found.
A
Anna L. Trella
Harvard University, School of Engineering and Applied Sciences, Cambridge, MA USA
Walter Dempsey
Walter Dempsey
University of Michigan
exchangeability / Bayesian nonparametricnetworkssurvival analysislatent variable modeling
F
F. Doshi-Velez
Harvard University, School of Engineering and Applied Sciences, Cambridge, MA USA
S
Susan A. Murphy
Harvard University, School of Engineering and Applied Sciences, Cambridge, MA USA