Stabilizing On-Policy Distillation for MLLM Reasoning with Global Normalization

📅 2026-06-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses gradient instability in policy distillation for multimodal large language models caused by anomalous states. To mitigate this issue, the authors propose Global Normalized Distillation Policy Optimization (GNDPO), which introduces a batch-level global normalization mechanism to transform token-level KL divergence supervision signals into relative advantages. This approach effectively alleviates gradient explosion while preserving fine-grained guidance capability. By integrating on-policy distillation with policy optimization, GNDPO significantly enhances training stability and downstream performance across diverse multimodal reasoning tasks.
📝 Abstract
On-policy distillation (OPD) has recently emerged as an important post-training paradigm. By using a stronger teacher model to provide dense, fine-grained supervision for sampled trajectories, OPD offers a clear advantage over reinforcement learning with verifiable rewards (RLVR), which typically depends on sparse binary or outcome-based environmental feedback. However, naive token-level distillation can suffer from gradient instability, due to magnitude misalignment in outlier states. To address this issue, we propose Globally Normalized Distillation Policy Optimization (GNDPO), a practical method that stabilizes optimization by transforming raw KL scores into batch-level relative advantages. This normalization effectively mitigates gradient explosions while retaining the benefits of token-level guidance. Experimental results show that GNDPO substantially improves training robustness and downstream performance across multimodal reasoning tasks. The code is released at https://github.com/OPPO-Mente-Lab/GNDPO.
Problem

Research questions and friction points this paper is trying to address.

on-policy distillation
gradient instability
multimodal reasoning
token-level distillation
outlier states
Innovation

Methods, ideas, or system contributions that make the work stand out.

On-policy distillation
Global normalization
Gradient stability
Multimodal reasoning
KL divergence