π€ AI Summary
To address low throughput and high token cost in speculative decoding for large-batch, variable-length-context LLM inference, this paper proposes SPIReβa novel draft model integrating static sparse attention, KV-cache pruning-based initialization, and feedback-aware memory. SPIRe jointly optimizes structural sparsity and cache efficiency to significantly improve speculation accuracy and throughput under long contexts, overcoming throughput bottlenecks inherent to small draft models and sparse self-speculation. Experiments on typical long-context batched workloads show that SPIRe achieves over 100% throughput improvement over small draft models and more than 35% gain over sparse self-speculation baselines, substantially reducing per-token inference cost. This work is the first to unify static sparse modeling, cache-aware initialization, and history-informed feedback memory within a speculative decoding framework, establishing a new paradigm for efficient large-batch LLM serving.
π Abstract
Speculative decoding (SD) has been shown to reduce the latency of autoregressive decoding (AD) by 2-3x for small batch sizes. However, increasing throughput and therefore reducing the cost per token requires decoding with large batch sizes. Recent work shows that SD can accelerate decoding with large batch sizes too if the context is sufficiently long and the draft model's KV cache is sparse. We introduce SPIRe, a draft model that combines static sparse attention, pruned initialization, and feedback memory to increase the modeled throughput of speculative decoding by over 100% compared to speculation with a much smaller draft model and by over 35% compared to the strong baseline of sparse self-speculation. Our approach is particularly effective when context lengths vary significantly across requests.