🤖 AI Summary
Reinforcement learning (RL) lacks flexible, online adjustment of policy behavior at test time without retraining. Method: We propose Policy Gradient Guidance (PGG), the first approach to incorporate classifier-free guidance—originally from diffusion models—into classical policy gradient frameworks. PGG augments the policy network with an unconditional branch and enables real-time behavioral modulation via advantage-weighted interpolation, requiring no fine-tuning. Contribution/Results: Under advantage normalization, we theoretically show that the guidance update admits a closed-form, analytically tractable expression. Practically, PGG integrates conditional dropout with advantage-based guidance, significantly improving sample efficiency and policy stability across both discrete and continuous control tasks. Extensive experiments on diverse RL benchmarks demonstrate its effectiveness, offering an interpretable, controllable, and training-free pathway for online policy adaptation in RL.
📝 Abstract
We introduce Policy Gradient Guidance (PGG), a simple extension of classifier-free guidance from diffusion models to classical policy gradient methods. PGG augments the policy gradient with an unconditional branch and interpolates conditional and unconditional branches, yielding a test-time control knob that modulates behavior without retraining. We provide a theoretical derivation showing that the additional normalization term vanishes under advantage estimation, leading to a clean guided policy gradient update. Empirically, we evaluate PGG on discrete and continuous control benchmarks. We find that conditioning dropout-central to diffusion guidance-offers gains in simple discrete tasks and low sample regimes, but dropout destabilizes continuous control. Training with modestly larger guidance ($γ>1$) consistently improves stability, sample efficiency, and controllability. Our results show that guidance, previously confined to diffusion policies, can be adapted to standard on-policy methods, opening new directions for controllable online reinforcement learning.