SPIRe: Boosting LLM Inference Throughput with Speculative Decoding

πŸ“… 2025-04-08
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address low throughput and high token cost in speculative decoding for large-batch, variable-length-context LLM inference, this paper proposes SPIReβ€”a novel draft model integrating static sparse attention, KV-cache pruning-based initialization, and feedback-aware memory. SPIRe jointly optimizes structural sparsity and cache efficiency to significantly improve speculation accuracy and throughput under long contexts, overcoming throughput bottlenecks inherent to small draft models and sparse self-speculation. Experiments on typical long-context batched workloads show that SPIRe achieves over 100% throughput improvement over small draft models and more than 35% gain over sparse self-speculation baselines, substantially reducing per-token inference cost. This work is the first to unify static sparse modeling, cache-aware initialization, and history-informed feedback memory within a speculative decoding framework, establishing a new paradigm for efficient large-batch LLM serving.

Technology Category

Application Category

πŸ“ Abstract
Speculative decoding (SD) has been shown to reduce the latency of autoregressive decoding (AD) by 2-3x for small batch sizes. However, increasing throughput and therefore reducing the cost per token requires decoding with large batch sizes. Recent work shows that SD can accelerate decoding with large batch sizes too if the context is sufficiently long and the draft model's KV cache is sparse. We introduce SPIRe, a draft model that combines static sparse attention, pruned initialization, and feedback memory to increase the modeled throughput of speculative decoding by over 100% compared to speculation with a much smaller draft model and by over 35% compared to the strong baseline of sparse self-speculation. Our approach is particularly effective when context lengths vary significantly across requests.
Problem

Research questions and friction points this paper is trying to address.

Improves LLM throughput via speculative decoding
Enhances large batch decoding efficiency
Optimizes performance for varying context lengths
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines static sparse attention for efficiency
Uses pruned initialization to reduce overhead
Implements feedback memory for dynamic adaptation
πŸ”Ž Similar Papers
No similar papers found.
S
Sanjit Neelam
Daniel Heinlein
Daniel Heinlein
Aalto University
subspace codingcoding theorygroupsinteger linear programmingcomplexity theory
V
Vaclav Cvicek
A
Akshay Mishra
R
Reiner Pope