When Perplexity Lies: Generation-Focused Distillation of Hybrid Sequence Models

πŸ“… 2026-03-27
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limitations of conventional perplexity-based evaluation in knowledge distillation for autoregressive generation tasks, which often leads to suboptimal architectures and training strategies. To overcome this, the authors propose a generation-quality-centric distillation paradigm, introducing a Hybrid-KDA architecture and a multi-stage GenDistill distillation pipeline. Through systematic analysis across six design dimensions, they identify data selection, masking strategy, and attention layer freezing as critical factors influencing generative performance. Experimental results demonstrate that the proposed approach preserves 86–90% of the teacher model’s knowledge accuracy while reducing KV cache memory usage by 75% and achieving a 2–4Γ— reduction in first-token latency at a context length of 128K tokens.
πŸ“ Abstract
Converting a pretrained Transformer into a more efficient hybrid model through distillation offers a promising approach to reducing inference costs. However, achieving high-quality generation in distilled models requires careful joint design of both the student architecture and the distillation process. Many prior distillation works evaluate downstream multiple-choice benchmarks by ranking candidate answers with log-likelihood rather than requiring autoregressive generation, which can obscure important differences in model quality. For example, we show that a 7B parameter distilled model that nearly matches its teacher to within 0.2\,pp under log-likelihood scoring actually falls behind by 20.8\,pp when the model must generate answers autoregressively. We propose a Hybrid Kimi Delta Attention (Hybrid-KDA) architecture paired with GenDistill, a multi-stage distillation pipeline, and use generation-based evaluation throughout to guide design decisions. Applying this approach to Qwen3-0.6B, we systematically ablate six design axes: training objective, loss masking, training duration, dataset selection, parameter freezing, and architecture choice. We find that log-likelihood-based evaluation consistently underestimates the gap between teacher and student, and can in some cases reverse the ranking of design choices, meaning that conclusions drawn from perplexity-only evaluation may be misleading. Among the factors we study, dataset selection, completion-only masking, and freezing attention layers during post-training have the largest impact on generation quality. Our best Hybrid-KDA model retains 86--90\% of teacher accuracy on knowledge benchmarks while reducing KV cache memory by up to 75\% and improving time-to-first-token by 2--4$\times$ at 128K-token contexts.
Problem

Research questions and friction points this paper is trying to address.

distillation
autoregressive generation
hybrid sequence models
generation quality
log-likelihood evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid-KDA
GenDistill
generation-focused distillation
autoregressive evaluation
KV cache reduction
πŸ”Ž Similar Papers
No similar papers found.
J
Juan Gabriel Kostelec
Huawei Zurich Research Center, Switzerland
X
Xiang Wang
ACS Lab, Huawei Technologies
Axel Laborieux
Axel Laborieux
Research Scientist, Huawei Technologies
Neuromorphic computingComputational NeuroscienceLearning algorithms
C
Christos Sourmpis
Huawei Zurich Research Center, Switzerland
Q
Qinghai Guo
ACS Lab, Huawei Technologies