Your Teacher Can't Help You Here: Combating Supervision Fidelity Decay in On-Policy Distillation

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

This work addresses supervision fidelity decay in policy distillation, where variable-length prefixes generated by the student model attenuate the teacher’s supervisory signal, thereby undermining the corrective power of reverse KL distillation and causing performance drift in long reasoning chains. The paper formally characterizes this issue for the first time and introduces the Lookahead Group Reward mechanism, which evaluates the teacher’s future confidence triggered by candidate tokens during generation and assigns normalized rewards accordingly. To maintain computational efficiency, an entropy-triggered tree attention mechanism is also devised. Experiments demonstrate that a 7B student model achieves an average improvement of 2.57 percentage points in mean@8 across six mathematical and code benchmarks, with gains particularly pronounced in long-sequence settings—reaching 4.92 percentage points on AIME-26 (39k tokens).

📝 Abstract

On-policy distillation transfers reasoning capabilities by training a student model on its own generated trajectories using token-level feedback from a teacher. However, we identify a critical bottleneck, \textbf{Supervision Fidelity Decay (SFD)}: as student-generated prefixes lengthen, the teacher's next-token distribution becomes less confident and less discriminative. Consequently, the teacher-dependent corrective signal in reverse-KL distillation weakens, causing student drift to compound across long reasoning chains. To mitigate SFD, we introduce \textbf{Lookahead Group Reward (\ours{})}. Building on the insight that next-step teacher confidence reflects the discriminative strength of future reverse-KL supervision, \ours{} evaluates the student's top-K candidate tokens by the teacher confidence they induce at the subsequent step and assigns a group-normalized reward. To maintain computational efficiency, we further design an entropy-triggered tree-attention mechanism. Across six math and code benchmarks, \ours{} improves mean@8 by \textbf{2.57} points over OPD for a 7B student, with gains increasing in longer-generation and reaching +\textbf{4.92} points on AIME-26 at 39k tokens.

Problem

Research questions and friction points this paper is trying to address.

Supervision Fidelity Decay

On-Policy Distillation

Student Drift

Teacher Confidence

Long-Chain Reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Supervision Fidelity Decay

Lookahead Group Reward

On-Policy Distillation