Trajectory-Refined Distillation

📅 2026-06-06

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the bimodal teacher signal distribution and gradient fragmentation in policy distillation caused by prefix failures. To mitigate these issues at their source, the authors propose a trajectory-level correction method that repairs student-generated trajectories under teacher guidance by fixing erroneous prefixes and injecting alternative valid reasoning paths. This approach enhances exploration while alleviating prefix failure, marking the first effort to elevate distillation intervention from the token level to the trajectory level. It integrates teacher-guided learning within the policy support and introduces privileged information conditioning suitable for self-distillation. Experimental results demonstrate consistent and significant improvements over existing methods across diverse benchmarks and model scales, with notable gains in both single-attempt accuracy and breadth of reasoning coverage.

📝 Abstract

On-policy distillation (OPD) has become a central post-training tool for large language models (LLMs), providing dense per-token teacher supervision along the student's own rollouts. In this work, we identify a common structural cause underlying OPD, which we call prefix failure. Under prefix failure, dense per-token supervision induces a bimodal teacher mixture and fragmented gradients that token-level loss truncation or reweighting fail to address. This observation motivates us to move beyond token-level loss interventions toward trajectory-level output corrections. We thus propose Trajectory-Refined Distillation (TRD), a trajectory-level correction method that revises the student's rollout under the teacher guidance while within on-policy support. By correcting problematic prefixes before distillation, TRD mitigates prefix failure at its source. Moreover, TRD improves the exploration by exposing the student to alternative valid derivations under teacher guidance, even when the original rolls are already correct. TRD can also be applied to on-policy self-distillation (OPSD), a parameter-sharing variant that uses the student model conditioned on privileged informations as the teacher. Across a wide range of benchmarks and base models at multiple scales, TRD consistently outperforms prior baselines, improving single-attempt accuracy and broadening reasoning coverage. Code is available at https://github.com/louieworth/trd

Problem

Research questions and friction points this paper is trying to address.

on-policy distillation

prefix failure

trajectory-level correction

dense per-token supervision

fragmented gradients

Innovation

Methods, ideas, or system contributions that make the work stand out.

Trajectory-Refined Distillation

On-policy Distillation

Prefix Failure