Gumiho: A Hybrid Architecture to Prioritize Early Tokens in Speculative Decoding

📅 2025-03-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing speculative decoding methods treat all tokens in the draft sequence uniformly, overlooking the critical guiding role of early tokens on subsequent generation—leading to low acceptance rates and limited speedup. This work theoretically establishes, for the first time, that early tokens in the draft sequence exhibit higher predictive importance. Building on this insight, we propose a hybrid architecture: a serial Transformer head at the front end precisely models long-range dependencies among early tokens, while a lightweight parallel MLP head at the back end efficiently generates later tokens. A hierarchical computation scheduling strategy coordinates these components. Our design preserves full model compatibility while significantly improving draft quality and acceptance rate. Experiments demonstrate that our method achieves end-to-end inference speedups over state-of-the-art speculative decoding approaches across multiple mainstream LLMs, with average acceleration ratios of 1.3–1.8×.

Technology Category

Application Category

📝 Abstract
Speculative decoding (SPD) aims to accelerate the auto-regressive token generation process of a target Large Language Model (LLM). Some approaches employ a draft model with multiple heads to predict a sequence of future tokens, where each head handles a token in the sequence. The target LLM verifies the predicted sequence and accepts aligned tokens, enabling efficient multi-token generation. However, existing methods assume that all tokens within a sequence are equally important, employing identical head structures and relying on a single-generation paradigm, either serial or parallel. To this end, we theoretically demonstrate that initial tokens in the draft sequence are more important than later ones. Building on this insight, we propose Gumiho, a hybrid model combining serial and parallel heads. Specifically, given the critical importance of early tokens, we employ a sophisticated Transformer architecture for the early draft heads in a serial configuration to improve accuracy. For later tokens, we utilize multiple lightweight MLP heads operating in parallel to enhance efficiency. By allocating more advanced model structures and longer running times to the early heads, Gumiho achieves improved overall performance. The experimental results demonstrate that our method outperforms existing approaches, fully validating its effectiveness.
Problem

Research questions and friction points this paper is trying to address.

Improves speculative decoding by prioritizing early tokens
Combines serial and parallel heads for better accuracy and efficiency
Uses advanced Transformer architecture for early draft heads
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid model combining serial and parallel heads
Sophisticated Transformer for early draft heads
Lightweight MLP heads for later tokens
🔎 Similar Papers
No similar papers found.
J
Jinze Li
Department of Electrical and Electronic Engineering, The University of Hong Kong
Yixing Xu
Yixing Xu
AMD
machine learningdeep learning
H
Haiduo Huang
Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
X
Xuanwu Yin
Advanced Micro Devices, Inc., Beijing, China
D
Dong Li
Advanced Micro Devices, Inc., Beijing, China
E
Edith C.H. Ngai
Department of Electrical and Electronic Engineering, The University of Hong Kong
E
E. Barsoum
Advanced Micro Devices, Inc., Beijing, China