Pipeline Parallelism is All You Need for Optimized Early-Exit Based Self-Speculative Decoding

📅 2025-09-19

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Autoregressive decoding in large language models (LLMs) incurs high inference latency, and existing early-exit speculative decoding (EESD) methods suffer from low draft token acceptance rates, often resulting in negative speedup. Method: We propose Pipeline-enabled Early-Exit Speculative Decoding (P-EESD), which restructures model layers into a pipelined architecture to overlap draft generation and verification stages—verifying the current token while concurrently drafting the next. This design integrates early-exit heads, self-speculation, pipeline parallelism, and fine-grained scheduling to eliminate computational waste from speculation failures. Contribution/Results: P-EESD achieves 2.01×–3.81× end-to-end speedup across multiple benchmarks, approaching theoretical optimum and significantly outperforming state-of-the-art EESD approaches while maintaining output quality.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) deliver impressive generation quality, but incur very high inference cost because each output token is generated auto-regressively through all model layers. Early-exit based self-speculative decoding (EESD) has emerged to mitigate this cost. However, in practice, many approaches struggle to achieve the expected acceleration in such draft-then-verify paradigm even with a well-aligned early-exit head and selected exit position. Our analysis reveals that EESD only pays off when the vast majority of draft tokens are accepted by the LLM. Otherwise, the draft cost may overcome the acceleration gain and lead to a negative speedup. To mitigate this, we propose Pipeline-Parallel Self-Speculative Decoding (PPSD) that fully pipelines the draft and verification work so that no effort is wasted on failed predictions. It has two key innovations. We configure the model layers as a pipeline in which early-exit (draft) computations and remaining-layer (verification) computations overlap. We interleave drafting and verification per token. While the LLM is verifying the current token in its final layers, the early-exit path simultaneously drafts the next token. Such a verify-while-draft scheme keeps all units busy and validates tokens on-the-fly analogous to pipelining the speculation and verification stages. Empirical results confirm that PPSD achieves state-of-the-art acceleration in self-speculative LLM inference. On diverse benchmarks, PPSD achieves speedup ratios in the range of 2.01x~3.81x, which gains almost the optimal acceleration at the fixed acceptance rate and exit position, showcasing its advancement in providing efficient self-speculation.

Problem

Research questions and friction points this paper is trying to address.

High inference costs from autoregressive token generation in large language models

Early-exit speculative decoding fails when draft tokens are frequently rejected

Current approaches waste computation on failed predictions reducing acceleration gains

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pipeline parallelism overlaps draft and verification stages

Interleaves token drafting with on-the-fly verification

Keeps all model units busy via verify-while-draft scheme

🔎 Similar Papers

No similar papers found.