๐ค AI Summary
In knowledge distillation, supervised methods suffer from train-inference distribution mismatch, while on-policy approaches yield inaccurate teacher feedback due to low-quality student-generated samples. This paper proposes Speculative Distillationโa novel framework where the student first generates candidate token sequences, and the teacher dynamically corrects only low-confidence tokens, enabling high-fidelity knowledge transfer under inference-time distribution alignment. Its core innovation is the first online, token-level, teacher-student collaborative correction mechanism, integrating confidence-driven interleaved sampling, teacher-guided dynamic reweighting, and multi-task joint training. Evaluated across machine translation, summarization, mathematical reasoning, and instruction-following tasks, the method consistently outperforms both supervised and on-policy distillation baselines. It demonstrates robust performance gains across diverse model scales, data regimes, and initialization strategies.
๐ Abstract
Recent advances in knowledge distillation (KD) have enabled smaller student models to approach the performance of larger teacher models. However, popular methods such as supervised KD and on-policy KD, are adversely impacted by the knowledge gaps between teacher-student in practical scenarios. Supervised KD suffers from a distribution mismatch between training with a static dataset and inference over final student-generated outputs. Conversely, on-policy KD, which uses student-generated samples for training, can suffer from low-quality training examples with which teacher models are not familiar, resulting in inaccurate teacher feedback. To address these limitations, we introduce Speculative Knowledge Distillation (SKD), a novel approach that leverages cooperation between student and teacher models to generate high-quality training data on-the-fly while aligning with the student's inference-time distribution. In SKD, the student proposes tokens, and the teacher replaces poorly ranked ones based on its own distribution, transferring high-quality knowledge adaptively. We evaluate SKD on various text generation tasks, including translation, summarization, math, and instruction following, and show that SKD consistently outperforms existing KD methods across different domains, data sizes, and model initialization strategies.