🤖 AI Summary
This work addresses gradient instability in policy distillation for multimodal large language models caused by anomalous states. To mitigate this issue, the authors propose Global Normalized Distillation Policy Optimization (GNDPO), which introduces a batch-level global normalization mechanism to transform token-level KL divergence supervision signals into relative advantages. This approach effectively alleviates gradient explosion while preserving fine-grained guidance capability. By integrating on-policy distillation with policy optimization, GNDPO significantly enhances training stability and downstream performance across diverse multimodal reasoning tasks.
📝 Abstract
On-policy distillation (OPD) has recently emerged as an important post-training paradigm. By using a stronger teacher model to provide dense, fine-grained supervision for sampled trajectories, OPD offers a clear advantage over reinforcement learning with verifiable rewards (RLVR), which typically depends on sparse binary or outcome-based environmental feedback. However, naive token-level distillation can suffer from gradient instability, due to magnitude misalignment in outlier states. To address this issue, we propose Globally Normalized Distillation Policy Optimization (GNDPO), a practical method that stabilizes optimization by transforming raw KL scores into batch-level relative advantages. This normalization effectively mitigates gradient explosions while retaining the benefits of token-level guidance. Experimental results show that GNDPO substantially improves training robustness and downstream performance across multimodal reasoning tasks. The code is released at https://github.com/OPPO-Mente-Lab/GNDPO.