Self-Distilled Policy Gradient

📅 2026-06-01
📈 Citations: 0
Influential: 0
📄 PDF

career value

150K/year
🤖 AI Summary
This work addresses the challenge of insufficient supervision in sparse-reward reinforcement learning by proposing a self-distillation policy gradient framework. The method constructs dense supervisory signals from model outputs generated under privileged contexts and optimizes the policy through precise online self-distillation over the full action vocabulary. To enhance training stability, it integrates group-relative advantage estimation, normalized advantage variance, and KL regularization toward a reference policy. Experimental results demonstrate that the proposed approach significantly outperforms RLVR and existing self-distillation methods across multiple benchmark tasks, achieving substantial improvements in both training stability and final performance.
📝 Abstract
On-policy self-distillation, where a language model conditions on privileged context to supervise its own generations, is a promising source of dense supervision for sparse-reward reinforcement learning. Actually, it can be instantiated as an auxiliary full-vocabulary student-to-teacher reverse Kullback-Leibler divergence loss. We therefore propose SDPG, a self-distilled policy-gradient framework that combines group-relative verifier advantages with normalized standard deviation, exact full-vocabulary on-policy self-distillation, as well as reference-policy KL regularization. Empirically, SDPG improves stability and performance over RLVR and self-distillation baselines. The code is available at https://github.com/lauyikfung/SDPG.
Problem

Research questions and friction points this paper is trying to address.

sparse-reward reinforcement learning
on-policy self-distillation
language model
dense supervision
policy gradient
Innovation

Methods, ideas, or system contributions that make the work stand out.

self-distillation
policy gradient
reverse KL divergence
sparse-reward RL
verifier advantages
🔎 Similar Papers