MicroSpec: Accelerating Speculative Decoding with Lightweight In-Context Vocabularies

📅 2026-04-08
📈 Citations: 0
Influential: 0
📄 PDF

career value

202K/year
🤖 AI Summary
This work addresses the substantial computational overhead of the output projection layer in large language models during speculative decoding, stemming from their ultra-large vocabularies (>100k tokens), while existing pruning methods still require retaining around 30k tokens to preserve generation quality. The authors propose a training-free, dynamic vocabulary construction method that leverages the temporal locality inherent in language generation to build extremely small (<3k tokens), context-aware active vocabularies at each decoding step. Through a co-design of system and algorithmic optimizations for sparse memory access, this approach achieves, for the first time, a minimal dynamic vocabulary without any additional training. Evaluated across seven tasks, it delivers 1.17–1.29× end-to-end speedup over EAGLE-2/3, reduces draft generation time by 51.6%, and shrinks the average vocabulary size by 40×, effectively breaking the conventional trade-off between vocabulary size and generation quality.
📝 Abstract
Large language models typically employ vocabularies of over 100k tokens, which creates a major computational bottleneck at the final linear projection layer when performing speculative decoding. Current methods for vocabulary pruning depend on either fixed or coarse-grained sub-vocabularies, requiring around 30k active tokens to preserve the quality of the draft model. We introduce MicroSpec, a training-free technique that overcomes this limitation by building a compact, context-sensitive active vocabulary on the fly for every decoding step. Exploiting the natural temporal locality found in language generation, MicroSpec attains high token coverage while reducing the average vocabulary size by more than 40x (down to under 3k tokens), all without any additional trained parameters. To translate this high sparsity into actual speedups on contemporary hardware, we present a co-designed system and algorithm that mitigates the overhead of sparse memory accesses via asynchronous gathering and GPU-resident state management. Acting as a plug-and-play enhancement, MicroSpec reduces draft inference latency by 51.6% on average, achieving an end-to-end speedup of 1.12-1.32x relative to the leading speculative decoding approach EAGLE-2 on various benchmarks, while also surpassing more sophisticated training-based pruning baselines.
Problem

Research questions and friction points this paper is trying to address.

speculative decoding
vocabulary pruning
large language models
computational bottleneck
active vocabulary
Innovation

Methods, ideas, or system contributions that make the work stand out.

speculative decoding
vocabulary pruning
context-aware vocabulary
system-algorithm co-design
training-free acceleration
Z
Zhiyang Chen
Institute for Artificial Intelligence, Peking University, Beijing, China
Daliang Xu
Daliang Xu
Peking university
mobile computingsystem software
Y
Yinyuan Zhang
Key Laboratory of High Confidence Software Technologies (Peking University), Ministry of Education; School of Computer Science, Peking University, Beijing, China
C
Chenghua Wang
State Key Laboratory of Networking and Switching Technology; School of Computer Science, Beijing University of Posts and Telecommunications, Beijing, China
Mengwei Xu
Mengwei Xu
Associate Professor, Beijing University of Posts and Telecommunications
Edge IntelligenceOperating System
Yun Ma
Yun Ma
Assistant Professor, Peking University
WebMobile ComputingSoftware EngineeringService