SPIRe: Boosting LLM Inference Throughput with Speculative Decoding

📅 2025-04-08

📈 Citations: 0

✨ Influential: 0

career value

244K/year

🤖 AI Summary

To address low throughput and high token cost in speculative decoding for large-batch, variable-length-context LLM inference, this paper proposes SPIRe—a novel draft model integrating static sparse attention, KV-cache pruning-based initialization, and feedback-aware memory. SPIRe jointly optimizes structural sparsity and cache efficiency to significantly improve speculation accuracy and throughput under long contexts, overcoming throughput bottlenecks inherent to small draft models and sparse self-speculation. Experiments on typical long-context batched workloads show that SPIRe achieves over 100% throughput improvement over small draft models and more than 35% gain over sparse self-speculation baselines, substantially reducing per-token inference cost. This work is the first to unify static sparse modeling, cache-aware initialization, and history-informed feedback memory within a speculative decoding framework, establishing a new paradigm for efficient large-batch LLM serving.

Technology Category

Application Category

📝 Abstract

Speculative decoding (SD) has been shown to reduce the latency of autoregressive decoding (AD) by 2-3x for small batch sizes. However, increasing throughput and therefore reducing the cost per token requires decoding with large batch sizes. Recent work shows that SD can accelerate decoding with large batch sizes too if the context is sufficiently long and the draft model's KV cache is sparse. We introduce SPIRe, a draft model that combines static sparse attention, pruned initialization, and feedback memory to increase the modeled throughput of speculative decoding by over 100% compared to speculation with a much smaller draft model and by over 35% compared to the strong baseline of sparse self-speculation. Our approach is particularly effective when context lengths vary significantly across requests.

Problem

Research questions and friction points this paper is trying to address.

Improves LLM throughput via speculative decoding

Enhances large batch decoding efficiency

Optimizes performance for varying context lengths

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines static sparse attention for efficiency

Uses pruned initialization to reduce overhead

Implements feedback memory for dynamic adaptation

🔎 Similar Papers

Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling