MicroSpec: Accelerating Speculative Decoding with Lightweight In-Context Vocabularies

📅 2026-04-08

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses the substantial computational overhead of the output projection layer in large language models during speculative decoding, stemming from their ultra-large vocabularies (>100k tokens), while existing pruning methods still require retaining around 30k tokens to preserve generation quality. The authors propose a training-free, dynamic vocabulary construction method that leverages the temporal locality inherent in language generation to build extremely small (<3k tokens), context-aware active vocabularies at each decoding step. Through a co-design of system and algorithmic optimizations for sparse memory access, this approach achieves, for the first time, a minimal dynamic vocabulary without any additional training. Evaluated across seven tasks, it delivers 1.17–1.29× end-to-end speedup over EAGLE-2/3, reduces draft generation time by 51.6%, and shrinks the average vocabulary size by 40×, effectively breaking the conventional trade-off between vocabulary size and generation quality.

📝 Abstract

Large language models typically employ vocabularies of over 100k tokens, which creates a major computational bottleneck at the final linear projection layer when performing speculative decoding. Current methods for vocabulary pruning depend on either fixed or coarse-grained sub-vocabularies, requiring around 30k active tokens to preserve the quality of the draft model. We introduce MicroSpec, a training-free technique that overcomes this limitation by building a compact, context-sensitive active vocabulary on the fly for every decoding step. Exploiting the natural temporal locality found in language generation, MicroSpec attains high token coverage while reducing the average vocabulary size by more than 40x (down to under 3k tokens), all without any additional trained parameters. To translate this high sparsity into actual speedups on contemporary hardware, we present a co-designed system and algorithm that mitigates the overhead of sparse memory accesses via asynchronous gathering and GPU-resident state management. Acting as a plug-and-play enhancement, MicroSpec reduces draft inference latency by 51.6% on average, achieving an end-to-end speedup of 1.12-1.32x relative to the leading speculative decoding approach EAGLE-2 on various benchmarks, while also surpassing more sophisticated training-based pruning baselines.

Problem

Research questions and friction points this paper is trying to address.

speculative decoding

vocabulary pruning

large language models

computational bottleneck

active vocabulary

Innovation

Methods, ideas, or system contributions that make the work stand out.

speculative decoding

vocabulary pruning

context-aware vocabulary