🤖 AI Summary
This paper addresses the challenge of efficiently learning Whittle indices for indexable, connected, and single-chain Markov decision processes (MDPs). We propose BLINQ—the first model-based method that jointly designs model learning and Whittle index computation. BLINQ constructs an empirical MDP and extends classical Whittle index algorithms, enabling index learning without neural networks. We establish theoretical convergence guarantees and derive a rigorous upper bound on learning time. Compared to existing Q-learning approaches, BLINQ reduces sample complexity by multiple-fold and significantly lowers total computational cost—even when Q-learning leverages pretrained neural networks for acceleration. BLINQ thus provides a new, efficient, and interpretable paradigm for restless multi-armed bandit (RMAB) problems under resource constraints.
📝 Abstract
We present BLINQ, a new model-based algorithm that learns the Whittle indices of an indexable, communicating and unichain Markov Decision Process (MDP). Our approach relies on building an empirical estimate of the MDP and then computing its Whittle indices using an extended version of a state-of-the-art existing algorithm. We provide a proof of convergence to the Whittle indices we want to learn as well as a bound on the time needed to learn them with arbitrary precision. Moreover, we investigate its computational complexity. Our numerical experiments suggest that BLINQ significantly outperforms existing Q-learning approaches in terms of the number of samples needed to get an accurate approximation. In addition, it has a total computational cost even lower than Q-learning for any reasonably high number of samples. These observations persist even when the Q-learning algorithms are speeded up using pre-trained neural networks to predict Q-values.