OPRD: On-Policy Representation Distillation

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

This work addresses the limitations of conventional policy distillation, which matches only the teacher model’s next-token probabilities in output space and suffers from high variance in Monte Carlo KL estimation over large vocabularies while ignoring intermediate hidden representations. To overcome these issues, the authors propose shifting distillation from the output space to the hidden state space, aligning student and teacher representations at specific layers along identical trajectories—entirely bypassing the language model head. This approach introduces trajectory-based hidden-layer alignment and cross-layer representation distillation into policy distillation for the first time, eliminating sampling variance and transferring structured knowledge. Experiments demonstrate significant narrowing of the performance gap between student and teacher models on the AIME 2024/2025 and AIMO benchmarks, with a 1.44× speedup in training and a 54% reduction in memory consumption.

📝 Abstract

On-policy distillation (OPD) supervises the student only in output space by matching next-token probabilities. This output-only paradigm has two limits: (1) sampling variance from Monte Carlo KL estimates over large vocabularies (e.g., Qwen's ~150k tokens) persists throughout training, and (2) it treats the teacher as a black-box, discarding all intermediate hidden states after the LM head. We propose On-Policy Representation Distillation (OPRD), which lifts distillation into hidden-state space by aligning student and teacher representations across selected layers on the same rollouts, bypassing the LM head entirely. Theoretically, OPRD eliminates sampling variance and provides richer per-layer structural information. Empirically, OPRD closes the student-teacher gap on AIME 2024/2025 and AIMO, while output-space OPD baselines plateau below the teacher. OPRD also trains 1.44x faster and uses 54% less memory than top-k OPD. Code: https://github.com/ShenzhiYang2000/OPRD.

Problem

Research questions and friction points this paper is trying to address.

on-policy distillation

sampling variance

hidden states

large vocabulary

knowledge distillation

Innovation

Methods, ideas, or system contributions that make the work stand out.

On-Policy Distillation

Representation Distillation

Hidden-State Alignment