A Survey of On-Policy Distillation for Large Language Models

📅 2026-04-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work systematically investigates on-policy distillation (OPD) for large language models to address the exposure bias arising from train-test mismatch in conventional off-policy knowledge distillation. We introduce, for the first time, a unified f-divergence theoretical framework that categorizes and integrates existing techniques along three orthogonal dimensions: feedback signal, teacher access mode, and loss granularity—encompassing white-box, black-box, and teacher-free settings as well as token-level and sequence-level losses. The study reveals an intrinsic connection between OPD and interactive imitation learning, reviews representative methods and industrial practices, and identifies key open challenges such as distillation scaling laws and uncertainty-aware feedback, thereby providing a clear technical roadmap for future research.
📝 Abstract
Knowledge distillation has become a primary mechanism for transferring reasoning and domain expertise from frontier Large Language Models (LLMs) to smaller, deployable students. However, the dominant paradigm remains \textit{off-policy}: students train on static teacher-generated data and never encounter their own errors during learning. This train--test mismatch, an instance of \textit{exposure bias}, causes prediction errors to compound autoregressively at inference time. On-Policy Distillation (OPD) addresses this by letting the student generate its own trajectories and receive teacher feedback on these self-generated outputs, grounding distillation in the theory of interactive imitation learning. Despite rapid growth spanning divergence minimization, reward-guided learning, and self-play, the OPD literature remains fragmented with no unified treatment. This survey provides the first comprehensive overview of OPD for LLMs. We introduce a unified $f$-divergence framework over on-policy samples and organize the landscape along three orthogonal dimensions: \emph{feedback signal} (logit-based, outcome-based, or self-play), \emph{teacher access} (white-box, black-box, or teacher-free), and \emph{loss granularity} (token-level, sequence-level, or hybrid). We systematically analyze representative methods, examine industrial deployments, and identify open problems including distillation scaling laws, uncertainty-aware feedback, and agent-level distillation.
Problem

Research questions and friction points this paper is trying to address.

On-Policy Distillation
Exposure Bias
Large Language Models
Knowledge Distillation
Imitation Learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

On-Policy Distillation
f-divergence framework
exposure bias
interactive imitation learning
LLM compression
🔎 Similar Papers
No similar papers found.
Mingyang Song
Mingyang Song
Tencent Inc.
NLPIRLLMs
M
Mao Zheng
Large Language Model Department, Tencent, China