🤖 AI Summary
This work addresses the issue of policy entropy collapse in reinforcement learning, which leads to reduced exploration diversity and weakened learning signals. To mitigate this, the authors propose TS-OPSD, a novel method that internalizes the temperature-based exploration effect directly into model parameters. Specifically, TS-OPSD generates a smoothed self-teacher distribution by applying high-temperature scaling to the model’s own logits and distills this distribution back into the student policy, thereby re-heating the policy without requiring external resources or additional inference overhead. Integrating policy self-distillation, temperature scaling, and reinforcement learning fine-tuning, TS-OPSD significantly outperforms standard continued RL and rollout-level temperature re-heating on Qwen3-4B and Qwen3-8B models, effectively reducing output sharpness while preserving reasoning capabilities.
📝 Abstract
Reinforcement learning from verifiable rewards improves the reasoning ability of large language models, but often suffers from entropy collapse, in which increasingly concentrated policies reduce rollout diversity and useful learning signals. Existing remedies either constrain the RL objective (e.g., entropy regularization) or adjust sampling temperature during rollout collection, but these interventions remain external to the model parameters. We propose Temperature-Scaled On-Policy Self-Distillation (TS-OPSD), a lightweight policy reheating method that internalizes the exploratory effect of temperature into model parameters. Starting from an entropy-collapsed RL checkpoint, TS-OPSD constructs a self-teacher by applying high-temperature scaling to the model's own logits, then distills the resulting smoother distribution back into the student. This policy reheating requires no external teacher, privileged data, or additional inference cost. Experiments on Qwen3-4B-Base and Qwen3-8B-Base show that policy reheating yields a stronger initialization for continued RL than both standard continued RL and rollout-level temperature reheating. Further analyses show that TS-OPSD mainly reduces output sharpness while preserving intermediate representations, top candidate sets, and reasoning capability. These results suggest that entropy restoration can serve as a simple post-collapse intervention for extending reasoning-oriented RL.