Trust-Region Behavior Blending for On-Policy Distillation

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

This work addresses the challenge in policy distillation where low-quality prefixes generated by the student policy in early training stages degrade the effectiveness of teacher supervision. To mitigate this issue, the authors propose Trust Region Blending (TRB), a method that initially samples trajectories using a behavioral policy constrained within a KL trust region around the student while remaining close to the teacher policy. The distillation objective retains the per-prefix reverse KL loss throughout training. A KL budget annealing schedule is employed to gradually shift sampling from the blended behavioral policy to the pure student policy. By integrating trust-region-constrained policy blending with a principled annealing mechanism, TRB significantly outperforms existing approaches on two mathematical reasoning distillation benchmarks and effectively alleviates prefix quality degradation during early distillation phases.

📝 Abstract

On-policy distillation (OPD) trains a student on prefixes sampled from its own policy while matching a stronger teacher. This addresses the prefix mismatch of offline distillation, but early student rollouts can still be poor, placing teacher supervision on weak or low-quality prefixes. We propose Trust-Region behavior Blending (TRB), a warmup method that replaces the early rollout policy with the closest-to-teacher behavior policy inside a student-centered KL trust region, while keeping the per-prefix reverse-KL OPD loss unchanged. The KL budget is annealed to zero, so training returns to pure student rollouts after warmup. Across two math-reasoning distillation settings, TRB attains the strongest average among the compared methods.

Problem

Research questions and friction points this paper is trying to address.

On-Policy Distillation

prefix mismatch

trust region

behavior policy

knowledge distillation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Trust-Region Behavior Blending

On-Policy Distillation

KL trust region