🤖 AI Summary
In long-context generation, the verification phase of speculative decoding becomes a computational bottleneck. To address this, we propose SpecPV—a lightweight self-speculative decoding method. Its core innovation lies in leveraging partial key-value (KV) cache states for rapid verification, complemented by periodic full-state verification to dynamically bound error accumulation. SpecPV requires no additional training and is compatible with mainstream LLM architectures (e.g., LLaMA-3.1-8B-Instruct, Qwen3), preserving generation accuracy while substantially reducing verification overhead. Experiments demonstrate that SpecPV achieves up to 6× decoding speedup on long-document understanding tasks, with negligible degradation in output quality. By decoupling verification cost from context length and mitigating error propagation without architectural modification, SpecPV effectively alleviates the efficiency bottleneck of speculative decoding in long-context scenarios.
📝 Abstract
Growing demands from tasks like code generation, deep reasoning, and long-document understanding have made long-context generation a crucial capability for large language models (LLMs). Speculative decoding is one of the most direct and effective approaches for accelerating generation. It follows a draft-verify paradigm, where a lightweight draft model proposes several candidate tokens and the target model verifies them. However, we find that as the context length grows, verification becomes the dominant bottleneck. To further accelerate speculative decoding in long-context generation, we introduce SpecPV, a self-speculative decoding approach that performs fast verification using partial key-value states (KV) and periodically applies full verification to eliminate accumulated errors. We validate SpecPV across multiple long-context benchmarks and models, including LLaMA-3.1-8B-Instruct and Qwen3-series. Experimental results show that SpecPV achieves up to 6x decoding speedup over standard autoregressive decoding with minor degradation.