🤖 AI Summary
This work addresses the challenge in policy distillation where low-quality prefixes generated by the student policy in early training stages degrade the effectiveness of teacher supervision. To mitigate this issue, the authors propose Trust Region Blending (TRB), a method that initially samples trajectories using a behavioral policy constrained within a KL trust region around the student while remaining close to the teacher policy. The distillation objective retains the per-prefix reverse KL loss throughout training. A KL budget annealing schedule is employed to gradually shift sampling from the blended behavioral policy to the pure student policy. By integrating trust-region-constrained policy blending with a principled annealing mechanism, TRB significantly outperforms existing approaches on two mathematical reasoning distillation benchmarks and effectively alleviates prefix quality degradation during early distillation phases.
📝 Abstract
On-policy distillation (OPD) trains a student on prefixes sampled from its own policy while matching a stronger teacher. This addresses the prefix mismatch of offline distillation, but early student rollouts can still be poor, placing teacher supervision on weak or low-quality prefixes. We propose Trust-Region behavior Blending (TRB), a warmup method that replaces the early rollout policy with the closest-to-teacher behavior policy inside a student-centered KL trust region, while keeping the per-prefix reverse-KL OPD loss unchanged. The KL budget is annealed to zero, so training returns to pure student rollouts after warmup. Across two math-reasoning distillation settings, TRB attains the strongest average among the compared methods.