D^2SD: Accelerating Speculative Decoding with Dual Diffusion Draft Models

📅 2026-06-03
📈 Citations: 0
Influential: 0
📄 PDF

career value

193K/year
🤖 AI Summary
This work addresses the limited acceptance rate of existing diffusion-based speculative decoding, which relies on a single draft sequence and discards all subsequent predictions upon the first token mismatch. To overcome this limitation, the authors propose a dual-diffusion draft framework: in the first stage, a diffusion model generates draft blocks annotated with positional confidence scores, enabling identification of potential rejection boundaries and extraction of critical prefixes; in the second stage, a variable-prefix diffusion model produces multiple alternative continuations in parallel, which are jointly verified through a cascaded attention mechanism. By innovatively integrating confidence-guided prefix trees with a two-stage diffusion process, the method achieves significant improvements over state-of-the-art speculative decoding approaches across multiple benchmarks, substantially enhancing acceptance rates, inference speedup, and throughput.
📝 Abstract
Speculative decoding accelerates autoregressive large language model inference by drafting multiple tokens and verifying them in a single target-model forward pass. Recent diffusion-based drafters generate an entire block of tokens in parallel but usually commit to a single draft sequence per verification: once the first mismatch occurs, all subsequent draft tokens are discarded, resulting in a limited acceptance rate. Naively batching more draft candidate sequences only introduces a marginal improvement, as redundant or poorly placed branches increase the cost of drafting and verification without proportionally increasing the number of accepted tokens. We propose D^2SD, a dual diffusion draft speculative decoding framework that organizes candidates into a confidence-guided prefix tree, where the first diffusion drafter generates a block along with per-position confidence scores that are used to identify the most likely rejection boundary and select the top-K prefix ranges for recovery; the second variable-prefix diffusion drafter re-anchors at each selected prefix and proposes alternative continuations in one batched pass; the resulting shared-prefix candidates are jointly verified via cascade attention. Empirically, D^2SD shows clear improvements over both the underlying diffusion approach and strong autoregressive speculative decoding baselines.
Problem

Research questions and friction points this paper is trying to address.

speculative decoding
diffusion models
token acceptance rate
draft sequences
autoregressive inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

speculative decoding
diffusion models
prefix tree
confidence-guided generation
cascade attention
🔎 Similar Papers