Dense Supervision, Sparse Updates: On the Sparsity and Geometry of On-Policy Distillation

📅 2026-06-11

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study investigates the impact of dense teacher supervision in offline policy distillation (OPD) on the sparsity and geometric structure of student model parameter updates. By integrating policy student trajectories with dense supervision signals and employing singular value decomposition, optimizer ablation, and subnetwork training, the work reveals that although OPD updates are numerically full-rank, they exhibit pronounced coordinate sparsity and spectral concentration—primarily acting along dimensions where source weights are near zero while preserving key post-training geometric features. The analysis further demonstrates that training only this sparse subnetwork suffices to recover the full model’s performance. Moreover, AdamW significantly outperforms SGD due to its ability to adapt to heterogeneous gradient scales.

📝 Abstract

On-policy distillation (\textsc{OPD}) has recently become a prominent post-training recipe as it combines two desirable ingredients: on-policy student trajectories and dense teacher supervision, yet how this hybrid changes a model's parameters remains unclear. Across several language and vision-language model pairs and use cases, our analysis yields two main findings. On sparsity, \textsc{OPD}-style updates are small and coordinate-sparse. They are distributed across layers and are usually FFN-heavy. This sparse structure is operationally useful: training only the discovered subnetwork recovers nearly the same performance as full \textsc{OPD}. However, the sparsity-inducing SGD optimizer underperforms AdamW in our optimizer ablation, likely because dense teacher supervision preserves heterogeneous coordinate-wise gradient scales where AdamW's adaptive scaling remains useful. On geometry, the updates are numerically full-rank but spectrally concentrated; they lie mostly away from the principal singular subspaces of the source weights and fall disproportionately on coordinates where the source weights are close to zero. These findings suggest that dense teacher supervision does not turn \textsc{OPD} into ordinary dense parameter rewriting; instead, \textsc{OPD} retains important geometric signatures of on-policy post-training.

Problem

Research questions and friction points this paper is trying to address.

On-policy Distillation

Sparsity

Geometry

Dense Supervision

Parameter Updates

Innovation

Methods, ideas, or system contributions that make the work stand out.

on-policy distillation

coordinate sparsity

spectral concentration