Preferences Evolve and so Should Your Bandits: Bandits with Evolving States for Online Platforms

📅 2023-07-21

🏛️ ACM Conference on Economics and Computation

📈 Citations: 0

✨ Influential: 0

career value

245K/year

🤖 AI Summary

To address the failure of classical multi-armed bandit (MAB) models under dynamically evolving user preferences, this paper proposes the Bandit model with Deterministic Evolutionary States (B-DES), which explicitly models unobservable yet regularly evolving system states—such as user interests—as core variables in the reward function. Methodologically, B-DES embeds a deterministic state evolution mechanism into the bandit framework and defines/optimizes regret with respect to the dynamic optimal action sequence. It subsumes the standard MAB as a special case and supports arbitrary evolution rates λ ∈ [0,1]. Theoretical contributions include: (i) an efficient online algorithm applicable across the full λ range; (ii) a rigorous upper bound of O(√(T(1+λT))) on its dynamic regret; and (iii) strong robustness guarantees against state observation noise and model misspecification in the evolutionary dynamics.

📝 Abstract

We propose a model for learning with bandit feedback while accounting for deterministically evolving and unobservable states that we call Bandits with Deterministically Evolving States ($B$-$DES$). The workhorse applications of our model are learning for recommendation systems and learning for online ads. In both cases, the reward that the algorithm obtains at each round is a function of the short-term reward of the action chosen and how"healthy"the system is (i.e., as measured by its state). For example, in recommendation systems, the reward that the platform obtains from a user's engagement with a particular type of content depends not only on the inherent features of the specific content, but also on how the user's preferences have evolved as a result of interacting with other types of content on the platform. Our general model accounts for the different rate $lambda in [0,1]$ at which the state evolves (e.g., how fast a user's preferences shift as a result of previous content consumption) and encompasses standard multi-armed bandits as a special case. The goal of the algorithm is to minimize a notion of regret against the best-fixed sequence of arms pulled, which is significantly harder to attain compared to standard benchmark of the best-fixed action in hindsight. We present online learning algorithms for any possible value of the evolution rate $lambda$ and we show the robustness of our results to various model misspecifications.

Problem

Research questions and friction points this paper is trying to address.

Multi-Armed Bandit Problem

Adaptive Strategies

Regret Minimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Learning Method

Deterministic State Evolution Bandit (B-DES)

Robust Performance in Assumption Inaccuracy

🔎 Similar Papers

No similar papers found.