Model Predictive Control is almost Optimal for Heterogeneous Restless Multi-armed Bandits

📅 2025-11-11

📈 Citations: 0

✨ Influential: 0

career value

234K/year

🤖 AI Summary

This paper studies average-reward optimization for heterogeneous restless multi-armed bandits (RMABs) over an infinite horizon. To address the challenge posed by fully heterogeneous dynamic model parameters across arms, we propose an LP-update policy based on model predictive control (MPC): it solves a linear programming relaxation over a finite rolling horizon and employs randomized rounding to generate feasible actions; notably, we introduce dissipativity theory—first in RMAB analysis—to establish performance guarantees. Theoretically, the policy achieves an optimality gap of $O(log N / sqrt{N})$ under full heterogeneity, substantially improving upon the classical LP-index policy. Empirical results show near-optimal performance with a prediction horizon as short as five, demonstrating both computational efficiency and strong convergence guarantees. The framework extends naturally to weakly coupled heterogeneous Markov decision processes.

Technology Category

Application Category

📝 Abstract

We consider a general infinite horizon Heterogeneous Restless multi-armed Bandit (RMAB). Heterogeneity is a fundamental problem for many real-world systems largely because it resists many concentration arguments. In this paper, we assume that each of the $N$ arms can have different model parameters. We show that, under a mild assumption of uniform ergodicity, a natural finite-horizon LP-update policy with randomized rounding, that was originally proposed for the homogeneous case, achieves an $O(log Nsqrt{1/N})$ optimality gap in infinite time average reward problems for fully heterogeneous RMABs. In doing so, we show results that provide strong theoretical guarantees on a well-known algorithm that works very well in practice. The LP-update policy is a model predictive approach that computes a decision at time $t$ by planing over a time-horizon ${tdots t+ au}$. Our simulation section demonstrates that our algorithm works extremely well even when $ au$ is very small and set to $5$, which makes it computationally efficient. Our theoretical results draw on techniques from the model predictive control literature by invoking the concept of emph{dissipativity} and generalize quite easily to the more general weakly coupled heterogeneous Markov Decision Process setting. In addition, we draw a parallel between our own policy and the LP-index policy by showing that the LP-index policy corresponds to $ au=1$. We describe where the latter's shortcomings arise from and how under our mild assumption we are able to address these shortcomings. The proof of our main theorem answers an open problem posed by (Brown et al 2020), paving the way for several new questions on the LP-update policies.

Problem

Research questions and friction points this paper is trying to address.

Analyzing heterogeneous restless multi-armed bandits with varying model parameters

Developing model predictive control policies for computationally efficient solutions

Addressing theoretical gaps in LP-update policies for infinite horizon problems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Model predictive control for heterogeneous restless bandits

Finite-horizon LP-update policy with randomized rounding

Dissipativity techniques from control theory generalization

🔎 Similar Papers

Model Predictive Control is Almost Optimal for Restless Bandit