Speculative Pipeline Decoding: Higher-Accruacy and Zero-Bubble Speculation via Pipeline Parallelism

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Existing speculative decoding methods rely on serial multi-token prediction, which leads to progressively increasing prediction difficulty and pipeline bubbles, thereby limiting acceleration in low-concurrency scenarios. This work proposes a speculative pipeline decoding framework that deeply integrates model pipeline parallelism into speculative decoding for the first time. By partitioning a large language model into multiple stages that process tokens in parallel, the framework introduces a parallel speculative module strictly synchronized with the main model and incorporates cross-stage intermediate feature aggregation alongside an efficient verification mechanism. This approach achieves bounded prediction complexity, high token acceptance rates, and zero bubble latency, significantly improving theoretical speedup and offering a highly scalable and efficient decoding solution for large language model inference.

📝 Abstract

Speculative Decoding (SD) accelerates low-concurrency LLM inference by employing a draft-then-verify paradigm. However, mainstream methods typically rely on multi-token prediction, which introduces escalating prediction difficulty and serial drafting latency. To address these, we propose Speculative Pipeline Decoding (SPD), a groundbreaking framework that unlocks the true potential of pipeline parallelism. By partitioning the target LLM into $n$ pipeline stages, SPD allows LLM to process $n$ tokens in parallel to accelerate decoding. To continuous fill the pipeline in single sequence decoding, a speculation module aggregates intermediate features across different pipeline depths to predict the next token, executing strictly in parallel with the target model's pipeline step, to realize bounded difficulty, higher acceptance rates, and zero latency bubbles. Our experiments demonstrate that SPD achieves a significantly higher theoretical speedup compared to mainstream baselines, offering a highly scalable solution for LLM decoding acceleration. Our code is available at https://github.com/yuyijiong/speculative_pipeline_decoding

Problem

Research questions and friction points this paper is trying to address.

Speculative Decoding

LLM inference

pipeline parallelism

decoding acceleration

multi-token prediction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Speculative Decoding

Pipeline Parallelism

LLM Inference Acceleration