🤖 AI Summary
This work addresses the limitations of multi-token prediction (MTP) methods, which often suffer from competition between the MTP head and the main language model (LM) head, leading to repetitive and incoherent outputs with marginal speedups. The authors propose a Backbone-as-Architect design principle: the primary LM head generates the first token, while the MTP head predicts only subsequent tokens. Coupled with a lightweight Complement Length Predictor (CLP) containing merely 4.6K–7.7K parameters, the framework dynamically determines the number of tokens that can be safely accepted at each step. This approach eliminates head-backbone competition and enables adaptive multi-token inference with zero quality degradation. Evaluated on Qwen2.5 (1.5B/7B), the method achieves 1.14–1.29× inference speedup and a repetition rate below 0.02%, substantially outperforming existing gating-based approaches, which offer ≤1.07× speedup and >0.5% repetition rates.
📝 Abstract
Large language model inference is bottlenecked by autoregressive decoding, where each token requires a full forward pass. Multi-token prediction (MTP) offers a promising acceleration path, but existing approaches suffer from a fundamental architectural flaw: the MTP head for the first token competes with the backbone's own language model (LM) head, leading to severe quality degradation when predictions are accepted. We identify this head-backbone competition as the root cause of repetitive and incoherent outputs in prior MTP-based acceleration methods. To address this, we propose Backbone-as-Architect, a design principle where the backbone LM head always generates the first token, and MTP heads are responsible only for subsequent tokens. Building on this principle, we introduce CLP (Collocation-Length Predictor), a lightweight span-level decision layer that predicts how many additional tokens can be safely accepted at each decoding step. CLP uses only a single linear layer (4.6K--7.7K parameters), replacing the over-engineered 1M-parameter gate networks used in prior work. Experiments on Qwen2.5 models (0.5B, 1.5B, 7B) show that CLP achieves 1.20x--1.29x speedup on 1.5B and 1.14x--1.20x on 7B, with zero quality degradation (repetition ratio < 0.02), while gate-based approaches fail to accelerate (1.07x) or produce severely degraded outputs (repetition ratio > 0.5%). We further demonstrate that shorter prediction horizons (k=2) recover 24% higher MTP head accuracy on large models, establishing a scaling-aware design principle. We identify MTP head prediction accuracy as the binding constraint on acceleration and establish a clear roadmap for future improvements.