🤖 AI Summary
This work identifies and formalizes a previously unaddressed issue in policy distillation, termed the “KL consistency trap,” wherein student models generate degraded prefixes that yield deceptively low reverse KL divergence due to local consistency with the teacher policy, thereby providing insufficient corrective signals. To mitigate this, the authors propose KAT (KL-aware Termination), a method that dynamically monitors reverse KL divergence during rollouts and employs an adaptive threshold to terminate uninformative trajectories in real time, effectively filtering out weak supervision signals. Evaluated on four mathematical reasoning benchmarks, KAT improves average avg@k accuracy by 2.66% and pass@k success rate by 3.43%, while reducing the average rollout length by 59.73%, substantially enhancing both training efficiency and model performance.
📝 Abstract
On-policy distillation (OPD) provides dense token-level supervision by asking a teacher to score student-generated rollouts. However, when the student drifts into an unrecoverable prefix, the teacher may locally agree with the degraded state, producing low reverse KL but little corrective training signal. We identify this persistent regime as a low-KL agreement trap. Further analyses show that tokens during and after such traps produce less useful supervision signals. We propose KAT (KL Agreement Trap Termination), an online OPD termination rule that detects persistent low-KL agreement with a dynamic training-adaptive threshold. By filtering weak supervision from degenerate agreement, KAT improves avg@k accuracy by 2.66% and pass@k by 3.43% across four mathematical benchmarks, while reducing average rollout length by 59.73%.