From Restless to Contextual: A Thresholding Bandit Approach to Improve Finite-horizon Performance

📅 2025-02-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper studies the budget-constrained online restless bandit problem over a finite horizon, aiming to maximize long-term reward through costly interventions. Each agent is modeled as an unknown Markov decision process (MDP) with evolving states, and only counterfactual outcomes under no intervention are observable—posing challenges in learning and decision-making. To address this, we reformulate the problem as a thresholded contextual bandit: state-embedded reward design implicitly captures state transitions, enabling focus on agents whose intervention benefit exceeds a learned threshold. Theoretically, we establish, for the first time, the optimality of the oracle greedy policy in the two-state setting and propose the first algorithm for multi-state heterogeneous environments that relies solely on no-intervention feedback, achieving minimax constant regret. Experiments demonstrate that our method significantly outperforms existing online restless bandit algorithms in finite-horizon intervention efficacy.

Technology Category

Application Category

📝 Abstract
Online restless bandits extend classic contextual bandits by incorporating state transitions and budget constraints, representing each agent as a Markov Decision Process (MDP). This framework is crucial for finite-horizon strategic resource allocation, optimizing limited costly interventions for long-term benefits. However, learning the underlying MDP for each agent poses a major challenge in finite-horizon settings. To facilitate learning, we reformulate the problem as a scalable budgeted thresholding contextual bandit problem, carefully integrating the state transitions into the reward design and focusing on identifying agents with action benefits exceeding a threshold. We establish the optimality of an oracle greedy solution in a simple two-state setting, and propose an algorithm that achieves minimax optimal constant regret in the online multi-state setting with heterogeneous agents and knowledge of outcomes under no intervention. We numerically show that our algorithm outperforms existing online restless bandit methods, offering significant improvements in finite-horizon performance.
Problem

Research questions and friction points this paper is trying to address.

Optimize finite-horizon resource allocation
Overcome MDP learning challenges
Enhance online restless bandit performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Scalable budgeted thresholding contextual bandit
Integrates state transitions into reward design
Achieves minimax optimal constant regret
🔎 Similar Papers
No similar papers found.