🤖 AI Summary
To address the failure of classical multi-armed bandit (MAB) models under dynamically evolving user preferences, this paper proposes the Bandit model with Deterministic Evolutionary States (B-DES), which explicitly models unobservable yet regularly evolving system states—such as user interests—as core variables in the reward function. Methodologically, B-DES embeds a deterministic state evolution mechanism into the bandit framework and defines/optimizes regret with respect to the dynamic optimal action sequence. It subsumes the standard MAB as a special case and supports arbitrary evolution rates λ ∈ [0,1]. Theoretical contributions include: (i) an efficient online algorithm applicable across the full λ range; (ii) a rigorous upper bound of O(√(T(1+λT))) on its dynamic regret; and (iii) strong robustness guarantees against state observation noise and model misspecification in the evolutionary dynamics.
📝 Abstract
We propose a model for learning with bandit feedback while accounting for deterministically evolving and unobservable states that we call Bandits with Deterministically Evolving States ($B$-$DES$). The workhorse applications of our model are learning for recommendation systems and learning for online ads. In both cases, the reward that the algorithm obtains at each round is a function of the short-term reward of the action chosen and how"healthy"the system is (i.e., as measured by its state). For example, in recommendation systems, the reward that the platform obtains from a user's engagement with a particular type of content depends not only on the inherent features of the specific content, but also on how the user's preferences have evolved as a result of interacting with other types of content on the platform. Our general model accounts for the different rate $lambda in [0,1]$ at which the state evolves (e.g., how fast a user's preferences shift as a result of previous content consumption) and encompasses standard multi-armed bandits as a special case. The goal of the algorithm is to minimize a notion of regret against the best-fixed sequence of arms pulled, which is significantly harder to attain compared to standard benchmark of the best-fixed action in hindsight. We present online learning algorithms for any possible value of the evolution rate $lambda$ and we show the robustness of our results to various model misspecifications.