Your Teacher Can't Help You Here: Combating Supervision Fidelity Decay in On-Policy Distillation

📅 2026-05-29
📈 Citations: 0
Influential: 0
📄 PDF

career value

167K/year
🤖 AI Summary
This work addresses supervision fidelity decay in policy distillation, where variable-length prefixes generated by the student model attenuate the teacher’s supervisory signal, thereby undermining the corrective power of reverse KL distillation and causing performance drift in long reasoning chains. The paper formally characterizes this issue for the first time and introduces the Lookahead Group Reward mechanism, which evaluates the teacher’s future confidence triggered by candidate tokens during generation and assigns normalized rewards accordingly. To maintain computational efficiency, an entropy-triggered tree attention mechanism is also devised. Experiments demonstrate that a 7B student model achieves an average improvement of 2.57 percentage points in mean@8 across six mathematical and code benchmarks, with gains particularly pronounced in long-sequence settings—reaching 4.92 percentage points on AIME-26 (39k tokens).
📝 Abstract
On-policy distillation transfers reasoning capabilities by training a student model on its own generated trajectories using token-level feedback from a teacher. However, we identify a critical bottleneck, \textbf{Supervision Fidelity Decay (SFD)}: as student-generated prefixes lengthen, the teacher's next-token distribution becomes less confident and less discriminative. Consequently, the teacher-dependent corrective signal in reverse-KL distillation weakens, causing student drift to compound across long reasoning chains. To mitigate SFD, we introduce \textbf{Lookahead Group Reward (\ours{})}. Building on the insight that next-step teacher confidence reflects the discriminative strength of future reverse-KL supervision, \ours{} evaluates the student's top-K candidate tokens by the teacher confidence they induce at the subsequent step and assigns a group-normalized reward. To maintain computational efficiency, we further design an entropy-triggered tree-attention mechanism. Across six math and code benchmarks, \ours{} improves mean@8 by \textbf{2.57} points over OPD for a 7B student, with gains increasing in longer-generation and reaching +\textbf{4.92} points on AIME-26 at 39k tokens.
Problem

Research questions and friction points this paper is trying to address.

Supervision Fidelity Decay
On-Policy Distillation
Student Drift
Teacher Confidence
Long-Chain Reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Supervision Fidelity Decay
Lookahead Group Reward
On-Policy Distillation
Reverse-KL Distillation
Tree Attention
🔎 Similar Papers
No similar papers found.