Policy Gradient Guidance Enables Test Time Control

📅 2025-10-02

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

Reinforcement learning (RL) lacks flexible, online adjustment of policy behavior at test time without retraining. Method: We propose Policy Gradient Guidance (PGG), the first approach to incorporate classifier-free guidance—originally from diffusion models—into classical policy gradient frameworks. PGG augments the policy network with an unconditional branch and enables real-time behavioral modulation via advantage-weighted interpolation, requiring no fine-tuning. Contribution/Results: Under advantage normalization, we theoretically show that the guidance update admits a closed-form, analytically tractable expression. Practically, PGG integrates conditional dropout with advantage-based guidance, significantly improving sample efficiency and policy stability across both discrete and continuous control tasks. Extensive experiments on diverse RL benchmarks demonstrate its effectiveness, offering an interpretable, controllable, and training-free pathway for online policy adaptation in RL.

Technology Category

Application Category

📝 Abstract

We introduce Policy Gradient Guidance (PGG), a simple extension of classifier-free guidance from diffusion models to classical policy gradient methods. PGG augments the policy gradient with an unconditional branch and interpolates conditional and unconditional branches, yielding a test-time control knob that modulates behavior without retraining. We provide a theoretical derivation showing that the additional normalization term vanishes under advantage estimation, leading to a clean guided policy gradient update. Empirically, we evaluate PGG on discrete and continuous control benchmarks. We find that conditioning dropout-central to diffusion guidance-offers gains in simple discrete tasks and low sample regimes, but dropout destabilizes continuous control. Training with modestly larger guidance ($γ>1$) consistently improves stability, sample efficiency, and controllability. Our results show that guidance, previously confined to diffusion policies, can be adapted to standard on-policy methods, opening new directions for controllable online reinforcement learning.

Problem

Research questions and friction points this paper is trying to address.

Extends classifier-free guidance to policy gradient reinforcement learning methods

Enables test-time behavior control without requiring model retraining

Improves stability and sample efficiency in continuous control benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

PGG extends classifier-free guidance to policy gradient methods

It enables test-time behavior control without retraining requirements

Guidance improves stability and sample efficiency in reinforcement learning

🔎 Similar Papers

Balance Reward and Safety Optimization for Safe Reinforcement Learning: A Perspective of Gradient Manipulation

2024-03-24AAAI Conference on Artificial IntelligenceCitations: 8

Bosch Group

Renningen, BW, DE

Master Thesis Bridging the Gap between Reinforcement Learning & E2E Driving

Bosch Group

Renningen, BW, DE

Research Scientist Intern, Robotic Control Policy (PhD)