When Perplexity Lies: Generation-Focused Distillation of Hybrid Sequence Models

📅 2026-03-27

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work addresses the limitations of conventional perplexity-based evaluation in knowledge distillation for autoregressive generation tasks, which often leads to suboptimal architectures and training strategies. To overcome this, the authors propose a generation-quality-centric distillation paradigm, introducing a Hybrid-KDA architecture and a multi-stage GenDistill distillation pipeline. Through systematic analysis across six design dimensions, they identify data selection, masking strategy, and attention layer freezing as critical factors influencing generative performance. Experimental results demonstrate that the proposed approach preserves 86–90% of the teacher model’s knowledge accuracy while reducing KV cache memory usage by 75% and achieving a 2–4× reduction in first-token latency at a context length of 128K tokens.

Technology Category

Application Category

📝 Abstract

Converting a pretrained Transformer into a more efficient hybrid model through distillation offers a promising approach to reducing inference costs. However, achieving high-quality generation in distilled models requires careful joint design of both the student architecture and the distillation process. Many prior distillation works evaluate downstream multiple-choice benchmarks by ranking candidate answers with log-likelihood rather than requiring autoregressive generation, which can obscure important differences in model quality. For example, we show that a 7B parameter distilled model that nearly matches its teacher to within 0.2\,pp under log-likelihood scoring actually falls behind by 20.8\,pp when the model must generate answers autoregressively. We propose a Hybrid Kimi Delta Attention (Hybrid-KDA) architecture paired with GenDistill, a multi-stage distillation pipeline, and use generation-based evaluation throughout to guide design decisions. Applying this approach to Qwen3-0.6B, we systematically ablate six design axes: training objective, loss masking, training duration, dataset selection, parameter freezing, and architecture choice. We find that log-likelihood-based evaluation consistently underestimates the gap between teacher and student, and can in some cases reverse the ranking of design choices, meaning that conclusions drawn from perplexity-only evaluation may be misleading. Among the factors we study, dataset selection, completion-only masking, and freezing attention layers during post-training have the largest impact on generation quality. Our best Hybrid-KDA model retains 86--90\% of teacher accuracy on knowledge benchmarks while reducing KV cache memory by up to 75\% and improving time-to-first-token by 2--4$\times$ at 128K-token contexts.

Problem

Research questions and friction points this paper is trying to address.

distillation

autoregressive generation

hybrid sequence models

generation quality

log-likelihood evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid-KDA

GenDistill

generation-focused distillation