๐ค AI Summary
This paper addresses the restless multi-armed bandit (RMAB) resource allocation problem under non-stationary environments with unknown transition kernels, where Whittle index computation is infeasible. To tackle this, we propose the first online learning algorithm for Whittle indices: it integrates sliding-window estimation of time-varying transition probabilities, a linear optimization-based predictive model, an upper-confidence-bound (UCB) exploration mechanism, and RMAB-structured priors to accelerate convergence. We establish a sublinear dynamic regret bound. Experiments demonstrate that our method achieves significantly lower cumulative regret than state-of-the-art baselines across diverse non-stationary settings, while maintaining computational efficiency and robustness. The framework provides a scalable theoretical and practical foundation for sequential decision-making in real-world time-varying systemsโsuch as dynamic network scheduling and adaptive healthcare interventions.
๐ Abstract
We consider optimal resource allocation for restless multi-armed bandits (RMABs) in unknown, non-stationary settings. RMABs are PSPACE-hard to solve optimally, even when all parameters are known. The Whittle index policy is known to achieve asymptotic optimality for a large class of such problems, while remaining computationally efficient. In many practical settings, however, the transition kernels required to compute the Whittle index are unknown and non-stationary. In this work, we propose an online learning algorithm for Whittle indices in this setting. Our algorithm first predicts current transition kernels by solving a linear optimization problem based on upper confidence bounds and empirical transition probabilities calculated from data over a sliding window. Then, it computes the Whittle index associated with the predicted transition kernels. We design these sliding windows and upper confidence bounds to guarantee sub-linear dynamic regret on the number of episodes $T$, under the condition that transition kernels change slowly over time (rate upper bounded by $ฮต=1/T^k$ with $k>0$). Furthermore, our proposed algorithm and regret analysis are designed to exploit prior domain knowledge and structural information of the RMABs to accelerate the learning process. Numerical results validate that our algorithm achieves superior performance in terms of lowest cumulative regret relative to baselines in non-stationary environments.