Bridging Reasoning Trajectories in On-Policy Distillation via Near-Future Guidance

📅 2026-05-29
📈 Citations: 0
Influential: 0
📄 PDF

career value

146K/year
🤖 AI Summary
Existing on-policy distillation (OPD) methods rely on token-level supervision signals, which struggle to distinguish genuine reasoning discrepancies from superficial surface-form differences, leading to insufficient alignment between student and teacher reasoning paths. This work proposes Trajectory-aware On-Policy Distillation (TOPD), which leverages near-future trajectory information to identify critical divergence points and extends supervision from individual tokens to multi-step future tokens, thereby achieving trajectory-level alignment. By integrating reverse KL correction with distribution shift detection, TOPD substantially improves upon standard OPD, increasing average accuracy from 47.8% to 52.2%. Notably, it achieves 63.3% on AIME24 (+3.3%) and 53.3% on AIME25 (+6.6%).
📝 Abstract
On-Policy Distillation (OPD) improves large language model reasoning by training a student model on trajectories sampled from its own policy under teacher supervision. Although OPD operates on trajectories, its learning signal remains token-level: it identifies deviations through high-loss tokens and repairs them through local reverse-KL correction. We show that this "trajectory-sampled but token-learned" mechanism cannot reliably bridge student trajectories toward teacher trajectories. About 30% of high-loss tokens fall into the low-divergence regime, indicating that many are surface-form mismatches rather than real reasoning forks. Moreover, even truly divergent tokens are difficult to repair with isolated token-level supervision, since reasoning failures often unfold as short-horizon distributional drift. We propose Trajectory-aware OPD (TOPD), which uses near-future trajectory information to identify real divergent states and distribute guidance across multiple future tokens. Experiments show that suppressing non-divergent high-loss tokens improves standard OPD from 47.8% to 48.2% average accuracy, while TOPD further improves performance to 52.2%, with gains on AIME24 from 60.0% to 63.3% and AIME25 from 46.7% to 53.3%.
Problem

Research questions and friction points this paper is trying to address.

On-Policy Distillation
reasoning trajectories
token-level supervision
distributional drift
trajectory alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

On-Policy Distillation
trajectory-aware learning
near-future guidance
reasoning alignment
distributional drift
🔎 Similar Papers
No similar papers found.