π€ AI Summary
This work addresses key limitations in Group Relative Policy Optimization (GRPO)βnamely, the loss of intra-group relative advantage due to all-or-nothing trajectory-level rewards and the resulting sparse supervision signalβas well as the difficulty of existing self-distillation approaches in aligning token-level preferences with trajectory correctness without reference answers. To overcome these challenges, the authors propose CAST, a novel self-distillation framework for reinforcement learning from verifier rewards (RLVR) that operates without privileged teachers or ground-truth references. CAST leverages a self-teacher with stop-gradient to generate token-level advantage signals and incorporates bidirectional local advantage sign flipping alongside a bounded baseline advantage mechanism, significantly enhancing feedback density and alignment. Experiments demonstrate that CAST substantially improves GRPO performance on mathematical reasoning tasks using only trajectory-level signals from a verifier, while maintaining computational efficiency and lightweight design.
π Abstract
Reinforcement learning with verifiable rewards (RLVR), especially Group Relative Policy Optimization (GRPO), has been widely used to improve reasoning in large language models. However, outcome-level rewards provide only sparse supervision, and group-relative advantages vanish when all sampled trajectories for a prompt are either correct or incorrect. On-Policy Self-Distillation (OPSD) offers dense token-level guidance, but its token preferences are not necessarily aligned with trajectory correctness; empirical diagnostics show that OPSD signals behave differently on correct and incorrect rollouts, with teacher-positive and teacher-negative gap signals exhibiting different noise profiles. These diagnostics are conducted under an OPSD-style privileged teacher context for analysis only, whereas CAST training uses answer-free self-teacher scoring.Motivated by these observations, this work proposes CAST, an answer-free self-distillation method for GRPO-style RLVR. CAST keeps the verifier-grounded GRPO objective, but uses a stop-gradient self-teacher to shape token-level advantages according to trajectory correctness. Unlike prior self-distilled RLVR methods, CAST does not require reference-solution-conditioned teacher scoring, keeps the self-teacher log-probability gap active throughout training, and applies bidirectional local advantage sign reversal: teacher-negative tokens in correct trajectories can receive negative token-level advantages, while teacher-positive tokens in incorrect trajectories can receive bounded positive local advantages. For zero-variance all-correct and all-wrong groups, CAST assigns bounded sign-constrained base advantages, so these otherwise zero-gradient groups can contribute verifier-signed token feedback. Experiments on mathematical reasoning show that CAST improves RLVR training while retaining a lightweight, verifier-grounded trajectory-level objective.