Howard's Policy Iteration is Subexponential for Deterministic Markov Decision Problems with Rewards of Fixed Bit-size and Arbitrary Discount Factor

📅 2025-05-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the long-standing exponential upper-bound barrier on the runtime of Howard’s policy iteration (HPI) for deterministic Markov decision processes (DMDPs). Under the setting of fixed reward bit-width and arbitrary discount factor, we establish the first discount-factor-independent subexponential upper bound—namely, (exp(O(sqrt{n}log n)))—depending solely on the number of states (n) and the reward bit-width, substantially improving upon the previously tightest exponential bounds. Our approach integrates structural modeling of policy graphs, combinatorial optimization analysis, bit-complexity theory, and DMDP reduction techniques. Notably, this bound applies even to the special case with only two distinct reward values of arbitrary magnitude. To date, this represents the strongest theoretical guarantee on the convergence rate of HPI.

Technology Category

Application Category

📝 Abstract
Howard's Policy Iteration (HPI) is a classic algorithm for solving Markov Decision Problems (MDPs). HPI uses a"greedy"switching rule to update from any non-optimal policy to a dominating one, iterating until an optimal policy is found. Despite its introduction over 60 years ago, the best-known upper bounds on HPI's running time remain exponential in the number of states -- indeed even on the restricted class of MDPs with only deterministic transitions (DMDPs). Meanwhile, the tightest lower bound for HPI for MDPs with a constant number of actions per state is only linear. In this paper, we report a significant improvement: a subexponential upper bound for HPI on DMDPs, which is parameterised by the bit-size of the rewards, while independent of the discount factor. The same upper bound also applies to DMDPs with only two possible rewards (which may be of arbitrary size).
Problem

Research questions and friction points this paper is trying to address.

Improving Howard's Policy Iteration runtime for deterministic MDPs
Establishing subexponential bound independent of discount factor
Analyzing HPI performance with fixed bit-size rewards
Innovation

Methods, ideas, or system contributions that make the work stand out.

Subexponential upper bound for HPI on DMDPs
Parameterised by bit-size of rewards
Independent of discount factor
🔎 Similar Papers
No similar papers found.
D
Dibyangshu Mukherjee
Department of Computer Science and Engineering, IIT Bombay, Mumbai, India
Shivaram Kalyanakrishnan
Shivaram Kalyanakrishnan
Indian Institute of Technology Bombay
Artificial IntelligenceMachine Learning