π€ AI Summary
This work addresses the limitations of conventional perplexity-based evaluation in knowledge distillation for autoregressive generation tasks, which often leads to suboptimal architectures and training strategies. To overcome this, the authors propose a generation-quality-centric distillation paradigm, introducing a Hybrid-KDA architecture and a multi-stage GenDistill distillation pipeline. Through systematic analysis across six design dimensions, they identify data selection, masking strategy, and attention layer freezing as critical factors influencing generative performance. Experimental results demonstrate that the proposed approach preserves 86β90% of the teacher modelβs knowledge accuracy while reducing KV cache memory usage by 75% and achieving a 2β4Γ reduction in first-token latency at a context length of 128K tokens.
π Abstract
Converting a pretrained Transformer into a more efficient hybrid model through distillation offers a promising approach to reducing inference costs. However, achieving high-quality generation in distilled models requires careful joint design of both the student architecture and the distillation process. Many prior distillation works evaluate downstream multiple-choice benchmarks by ranking candidate answers with log-likelihood rather than requiring autoregressive generation, which can obscure important differences in model quality. For example, we show that a 7B parameter distilled model that nearly matches its teacher to within 0.2\,pp under log-likelihood scoring actually falls behind by 20.8\,pp when the model must generate answers autoregressively.
We propose a Hybrid Kimi Delta Attention (Hybrid-KDA) architecture paired with GenDistill, a multi-stage distillation pipeline, and use generation-based evaluation throughout to guide design decisions. Applying this approach to Qwen3-0.6B, we systematically ablate six design axes: training objective, loss masking, training duration, dataset selection, parameter freezing, and architecture choice. We find that log-likelihood-based evaluation consistently underestimates the gap between teacher and student, and can in some cases reverse the ranking of design choices, meaning that conclusions drawn from perplexity-only evaluation may be misleading. Among the factors we study, dataset selection, completion-only masking, and freezing attention layers during post-training have the largest impact on generation quality.
Our best Hybrid-KDA model retains 86--90\% of teacher accuracy on knowledge benchmarks while reducing KV cache memory by up to 75\% and improving time-to-first-token by 2--4$\times$ at 128K-token contexts.