Efficient Computation of Blackwell Optimal Policies using Rational Functions

📅 2025-08-25

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

Computing Blackwell-optimal policies for Markov decision processes (MDPs) is notoriously difficult: existing algorithms suffer from high computational cost or poor implementability. Method: We introduce a novel framework based on the order relation of rational functions near 1, unifying the trade-off between long-term and short-term rewards. For the first time, we design strongly polynomial-time algorithms for deterministic MDPs and subexponential-time algorithms for general MDPs, extending convergence bounds of discounted policy iteration to the Blackwell optimality criterion. By replacing numerical computation with symbolic rational function arithmetic and embedding it into an advanced policy iteration scheme, our approach avoids bit-complexity dependencies. Contribution/Results: We obtain the first strongly polynomial-time algorithm for Blackwell-optimal policy computation in deterministic MDPs. For general MDPs, we achieve the fastest known subexponential-time algorithm, significantly improving both theoretical tractability and practical applicability.

Technology Category

Application Category

📝 Abstract

Markov Decision Problems (MDPs) provide a foundational framework for modelling sequential decision-making across diverse domains, guided by optimality criteria such as discounted and average rewards. However, these criteria have inherent limitations: discounted optimality may overly prioritise short-term rewards, while average optimality relies on strong structural assumptions. Blackwell optimality addresses these challenges, offering a robust and comprehensive criterion that ensures optimality under both discounted and average reward frameworks. Despite its theoretical appeal, existing algorithms for computing Blackwell Optimal (BO) policies are computationally expensive or hard to implement. In this paper we describe procedures for computing BO policies using an ordering of rational functions in the vicinity of $1$. We adapt state-of-the-art algorithms for deterministic and general MDPs, replacing numerical evaluations with symbolic operations on rational functions to derive bounds independent of bit complexity. For deterministic MDPs, we give the first strongly polynomial-time algorithms for computing BO policies, and for general MDPs we obtain the first subexponential-time algorithm. We further generalise several policy iteration algorithms, extending the best known upper bounds from the discounted to the Blackwell criterion.

Problem

Research questions and friction points this paper is trying to address.

Computing Blackwell optimal policies is computationally expensive

Existing algorithms are hard to implement practically

Need efficient methods for deterministic and general MDPs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Rational functions replace numerical evaluations

Symbolic operations enable bit complexity independence

Strongly polynomial-time for deterministic MDPs

🔎 Similar Papers

Doubly Optimal Policy Evaluation for Reinforcement Learning