Escaping the KL Agreement Trap in On-Policy Distillation

📅 2026-06-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies and formalizes a previously unaddressed issue in policy distillation, termed the “KL consistency trap,” wherein student models generate degraded prefixes that yield deceptively low reverse KL divergence due to local consistency with the teacher policy, thereby providing insufficient corrective signals. To mitigate this, the authors propose KAT (KL-aware Termination), a method that dynamically monitors reverse KL divergence during rollouts and employs an adaptive threshold to terminate uninformative trajectories in real time, effectively filtering out weak supervision signals. Evaluated on four mathematical reasoning benchmarks, KAT improves average avg@k accuracy by 2.66% and pass@k success rate by 3.43%, while reducing the average rollout length by 59.73%, substantially enhancing both training efficiency and model performance.
📝 Abstract
On-policy distillation (OPD) provides dense token-level supervision by asking a teacher to score student-generated rollouts. However, when the student drifts into an unrecoverable prefix, the teacher may locally agree with the degraded state, producing low reverse KL but little corrective training signal. We identify this persistent regime as a low-KL agreement trap. Further analyses show that tokens during and after such traps produce less useful supervision signals. We propose KAT (KL Agreement Trap Termination), an online OPD termination rule that detects persistent low-KL agreement with a dynamic training-adaptive threshold. By filtering weak supervision from degenerate agreement, KAT improves avg@k accuracy by 2.66% and pass@k by 3.43% across four mathematical benchmarks, while reducing average rollout length by 59.73%.
Problem

Research questions and friction points this paper is trying to address.

on-policy distillation
KL agreement trap
supervision signal
student-teacher learning
sequence generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

on-policy distillation
KL agreement trap
KAT
adaptive threshold
token-level supervision
🔎 Similar Papers
No similar papers found.