StaQ it! Growing neural networks for Policy Mirror Descent

📅 2025-06-16

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

In reinforcement learning, Policy Mirror Descent (PMD) provides rigorous theoretical guarantees for regularized policy optimization, but its closed-form update requires accumulating all historical Q-functions—rendering it infeasible for deep RL. To address this, we propose StaQ: a practical deep PMD algorithm that approximates unbiased PMD updates via a sliding window retaining only the most recent $M$ Q-functions. StaQ is the first deep RL method to strictly satisfy PMD’s theoretical conditions without approximation bias. We prove its convergence under finite $M$, circumventing the cumulative error induced by conventional truncation. Furthermore, StaQ incorporates dynamic network architecture growth to adapt to the evolving Q-function representation. Empirical evaluation on standard benchmarks shows StaQ matches or exceeds SAC and TRPO in final performance while significantly reducing training instability and policy performance variance—establishing a new paradigm for stable, theoretically grounded, and interpretable deep reinforcement learning.

Technology Category

Application Category

📝 Abstract

In Reinforcement Learning (RL), regularization has emerged as a popular tool both in theory and practice, typically based either on an entropy bonus or a Kullback-Leibler divergence that constrains successive policies. In practice, these approaches have been shown to improve exploration, robustness and stability, giving rise to popular Deep RL algorithms such as SAC and TRPO. Policy Mirror Descent (PMD) is a theoretical framework that solves this general regularized policy optimization problem, however the closed-form solution involves the sum of all past Q-functions, which is intractable in practice. We propose and analyze PMD-like algorithms that only keep the last $M$ Q-functions in memory, and show that for finite and large enough $M$, a convergent algorithm can be derived, introducing no error in the policy update, unlike prior deep RL PMD implementations. StaQ, the resulting algorithm, enjoys strong theoretical guarantees and is competitive with deep RL baselines, while exhibiting less performance oscillation, paving the way for fully stable deep RL algorithms and providing a testbed for experimentation with Policy Mirror Descent.

Problem

Research questions and friction points this paper is trying to address.

Overcoming intractable sum of past Q-functions in PMD

Reducing performance oscillation in deep RL algorithms

Enabling stable Policy Mirror Descent implementations

Innovation

Methods, ideas, or system contributions that make the work stand out.

PMD-like algorithms with last M Q-functions

Convergent algorithm for finite M

StaQ: stable deep RL with PMD

🔎 Similar Papers

Functional Acceleration for Policy Mirror Descent