🤖 AI Summary
This work addresses the sharp decline in draft acceptance rates during speculative decoding with medium- to long-context sequences, caused by the mismatch between sparse and full key-value (KV) caches, which severely limits acceleration gains. To overcome this, the authors propose BudgetDraft—a unified teacher-student distillation framework that trains a single draft model using multi-view sparse KV training and an acceptance-aware loss. This enables the model to maintain high acceptance rates across varying KV budgets while remaining memory-efficient, without requiring additional inference components. Experiments on PG-19, LongBench, and LWM demonstrate that BudgetDraft achieves up to 6.55×, 4.46×, and 2.10× end-to-end speedup over autoregressive decoding at context lengths of 4K, 8K, and 16K, respectively.
📝 Abstract
Speculative decoding speeds up autoregressive decoding by using a drafter to propose multiple tokens that a verifier validates in parallel. In resource-constrained deployments, the drafter uses a sparse KV cache to limit peak GPU memory and end-to-end latency under a fixed KV budget, while the verifier keeps a full KV cache. Mid-to-long context inference (4K--16K context length) is common in real applications. However, naive sparse/full speculative decoding suffers from the sparse/full mismatch as context length grows, causing the acceptance rate to drop quickly. We propose BudgetDraft, a multi-view sparse training method for sparse drafting in mid-to-long inference. The drafter is exposed to multiple sampled KV budgets during training and learns to align each sparse view with one shared full-cache teacher target. BudgetDraft combines an acceptance-aware loss on a full-cache branch with a multi-view loss on a sparse-cache branch, producing a single budget-robust drafter that recovers acceptance across sparsity levels without extra inference-time components. Experimental results on PG-19, LongBench, and LWM show that BudgetDraft achieves up to 6.55x, 4.46x, and 2.10x end-to-end speedup vs AR at 4K, 8K, and 16K context lengths, while keeping the inference pipeline memory-friendly.