Towards Blackwell Optimality: Bellman Optimality Is All You Can Get

πŸ“… 2025-10-15
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This paper investigates Blackwell optimality and the identification of lower-order bias-optimal policies in Markov decision processes (MDPs). Addressing the asymptotic nature and limited practicality of average-reward optimality, we propose a learning algorithm based on vanishing error probability to sequentially compute *k*-order bias-optimal policies. Our key contribution is a universal stopping criterion independent of the optimality order: when the MDP admits a unique Bellman-optimal policy, the algorithm terminates in finite time. Integrating reinforcement learning design, statistical hypothesis testing, and Bellman equation analysis, the method achieves asymptotically consistent identification of all orders of optimal policies. We prove that, as the error probability tends to zero, the algorithm identifies all policies up to Blackwell optimality with probability one, and guarantees verifiable finite-time convergence.

Technology Category

Application Category

πŸ“ Abstract
Although average gain optimality is a commonly adopted performance measure in Markov Decision Processes (MDPs), it is often too asymptotic. Further incorporating measures of immediate losses leads to the hierarchy of bias optimalities, all the way up to Blackwell optimality. In this paper, we investigate the problem of identifying policies of such optimality orders. To that end, for each order, we construct a learning algorithm with vanishing probability of error. Furthermore, we characterize the class of MDPs for which identification algorithms can stop in finite time. That class corresponds to the MDPs with a unique Bellman optimal policy, and does not depend on the optimality order considered. Lastly, we provide a tractable stopping rule that when coupled to our learning algorithm triggers in finite time whenever it is possible to do so.
Problem

Research questions and friction points this paper is trying to address.

Investigating identification of Blackwell optimal policies in MDPs
Developing learning algorithms with vanishing error probability
Characterizing MDPs where optimal policies can be finitely identified
Innovation

Methods, ideas, or system contributions that make the work stand out.

Learning algorithm with vanishing error probability
Characterizing MDPs with unique Bellman optimal policy
Tractable finite-time stopping rule for identification
Victor Boone
Victor Boone
Post Doc, UniversitΓ© Grenoble Alpes
Markov decision processesBanditsReinforcement learningGame theoryRegret minimization
A
Adrienne Tuynman
Univ. Lille, Inria, CNRS, Centrale Lille, UMR 9189-CRIStAL, F-59000 Lille, France