When Should the Teacher Move? Temporal Coupling and Stability in Self On-Policy Distillation

📅 2026-06-02
📈 Citations: 0
Influential: 0
📄 PDF

career value

183K/year
🤖 AI Summary
This work addresses the instability and task-agnostic collapse commonly observed in self-play policy distillation, which often stem from ill-timed updates of the teacher policy. Through a systematic analysis of the temporal coupling between the teacher’s freezing interval (quarantine period) and the student’s learning dynamics, the study identifies clock-driven teacher refreshes as a primary cause of collapse. To mitigate this, the authors propose Consolidation-Gated Teacher Refresh (CGTR), an adaptive gating mechanism that triggers teacher updates only when jointly validated by improvements in reward and safe trajectory length. Requiring no task-specific hyperparameter tuning, CGTR achieves zero collapse across four diverse tasks—Chemistry, Biology, Physics, and ToolUse—while attaining state-of-the-art performance under a unified hyperparameter configuration and automatically adjusting the teacher refresh frequency per task.
📝 Abstract
Self on-policy distillation trains a student policy against a teacher derived from its own parameter history, yet the teacher's update schedule -- which governs the \emph{temporal coupling} between teacher and student -- has not been systematically studied as a stability variable. Through a controlled schedule sweep on Qwen3-8B, we establish that \emph{isolation periods}, defined as complete teacher freezing between updates, are the key structural property enabling stable learning, not teacher age. To characterize these underlying training dynamics, we introduce a diagnostic framework of temporal KL structure, refresh shock, and length-tail risk. This framework further uncovers \emph{state-oblivious collapse}: optimal short-horizon fixed schedules catastrophically fail under long-horizon training because a clock-driven refresh can copy a transiently drifting student into the teacher in a single, irreversible step. This failure mode is invisible under short-horizon evaluation and mechanistically distinct from EMA's chronic contamination. To address this, we propose \emph{Consolidation-Gated Teacher Refresh} (CGTR), which preserves isolation periods while gating each refresh on joint evidence of reward improvement and length-tail safety, ensuring every teacher movement responds to genuine student consolidation rather than a clock signal. With a single shared parameter set and no per-dataset retuning, CGTR achieves \textbf{zero collapse} and the best final score on all four tasks (Chemistry, Biology, Physics, ToolUse), self-regulating its refresh frequency to each task's learning dynamics.
Problem

Research questions and friction points this paper is trying to address.

temporal coupling
teacher-student stability
isolation periods
state-oblivious collapse
self-distillation
Innovation

Methods, ideas, or system contributions that make the work stand out.

temporal coupling
isolation periods
state-oblivious collapse
Consolidation-Gated Teacher Refresh
self on-policy distillation
🔎 Similar Papers
No similar papers found.