Reliable Critics: Monotonic Improvement and Convergence Guarantees for Reinforcement Learning

📅 2025-06-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In reinforcement learning, function approximation compromises the monotonic improvement and convergence guarantees of policy iteration. To address this, we propose Reliable Policy Iteration (RPI), which replaces conventional projection-based or Bellman-error-minimization approaches with Bellman-constrained optimization. Under linear function approximation, RPI is the first method to rigorously guarantee monotonic value estimation improvement, lower-bounded value estimates, and partial satisfaction of the projection-free Bellman equation. RPI constructs the first model-free critic with both theoretical reliability and practical applicability, seamlessly integrating into mainstream algorithms such as DQN and DDPG. Empirically, on classical control benchmarks, RPI-enhanced agents consistently maintain provable return lower bounds while matching or exceeding the performance of all baselines. Thus, RPI provides the first scalable and theoretically sound solution to policy iteration under function approximation.

Technology Category

Application Category

📝 Abstract
Despite decades of research, it remains challenging to correctly use Reinforcement Learning (RL) algorithms with function approximation. A prime example is policy iteration, whose fundamental guarantee of monotonic improvement collapses even under linear function approximation. To address this issue, we introduce Reliable Policy Iteration (RPI). It replaces the common projection or Bellman-error minimization during policy evaluation with a Bellman-based constrained optimization. We prove that not only does RPI confer textbook monotonicity on its value estimates but these estimates also lower bound the true return. Also, their limit partially satisfies the unprojected Bellman equation, emphasizing RPI's natural fit within RL. RPI is the first algorithm with such monotonicity and convergence guarantees under function approximation. For practical use, we provide a model-free variant of RPI that amounts to a novel critic. It can be readily integrated into primary model-free PI implementations such as DQN and DDPG. In classical control tasks, such RPI-enhanced variants consistently maintain their lower-bound guarantee while matching or surpassing the performance of all baseline methods.
Problem

Research questions and friction points this paper is trying to address.

Ensuring monotonic improvement in RL with function approximation
Addressing policy iteration collapse under linear approximation
Providing convergence guarantees for RL algorithms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bellman-based constrained optimization replaces projection
Monotonic improvement guarantees under function approximation
Model-free variant integrates with DQN and DDPG
🔎 Similar Papers
No similar papers found.
R
R. EshwarS.
Computer Science and Automation, Indian Institute of Science, Bengaluru, India 560012
G
Gugan Thoppe
Computer Science and Automation, Indian Institute of Science, Bengaluru, India 560012
A
Aditya Gopalan
Electrical and Communication Engineering, Indian Institute of Science, Bengaluru, India 560012
Gal Dalal
Gal Dalal
Sr. Research Scientist, Nvidia
Reinforcement LearningMachine LearningPower SystemsOptimization