🤖 AI Summary
This work addresses the instability and optimization failures in large language model policy distillation caused by significant distributional discrepancies between teacher and student models. To mitigate this issue, the authors propose a trustworthy-region policy learning mechanism that restricts distillation to regions where the teacher model exhibits high confidence. This approach integrates several key components: masking of anomalous regions, gradient clipping, forward KL divergence estimation, and off-policy guidance using teacher-prefixed sequences. Together, these techniques substantially enhance the robustness and stability of token-level supervised learning. Empirical evaluations demonstrate that the proposed method consistently outperforms state-of-the-art policy distillation approaches across diverse benchmarks, including mathematical reasoning, code generation, and general-domain tasks.
📝 Abstract
On-Policy Distillation (OPD) is a fundamental technique for efficient post-training of large language models (LLMs), with broad applications in agent learning, multi-task enhancement, and model compression. However, OPD training becomes unstable when the teacher and student distributions differ substantially, as teacher supervision on student-generated tokens may yield unreliable policy gradients and even cause optimization failure. This work addresses reliable on-policy token-level supervision through credit assignment strategies, and proposes Trust Region On-Policy Distillation, TrOPD. It features the following characteristics: 1) Trust-Region On-Policy Learning: TrOPD performs OPD only in regions where the teacher provides reliable supervision, mitigating the optimization difficulty of the K1 reverse-KL estimator under distribution mismatch. 2) Outlier Estimation: For outlier regions, we explore gradient clipping, masking, and forward-KL estimation to reduce the adverse effects of unreliable supervision. 3) Off-Policy Guidance: The student continues generation from teacher prefixes and uses forward KL to imitate off-policy guidance, encouraging on-policy exploration toward reliable regions. Experiments show that TrOPD consistently outperforms SoTA OPD baselines, including OPD, EOPD, and REOPOLD, across mathematical reasoning, code generation, and general-domain benchmarks.