🤖 AI Summary
This work addresses the limited acceptance rate of existing diffusion-based speculative decoding, which relies on a single draft sequence and discards all subsequent predictions upon the first token mismatch. To overcome this limitation, the authors propose a dual-diffusion draft framework: in the first stage, a diffusion model generates draft blocks annotated with positional confidence scores, enabling identification of potential rejection boundaries and extraction of critical prefixes; in the second stage, a variable-prefix diffusion model produces multiple alternative continuations in parallel, which are jointly verified through a cascaded attention mechanism. By innovatively integrating confidence-guided prefix trees with a two-stage diffusion process, the method achieves significant improvements over state-of-the-art speculative decoding approaches across multiple benchmarks, substantially enhancing acceptance rates, inference speedup, and throughput.
📝 Abstract
Speculative decoding accelerates autoregressive large language model inference by drafting multiple tokens and verifying them in a single target-model forward pass. Recent diffusion-based drafters generate an entire block of tokens in parallel but usually commit to a single draft sequence per verification: once the first mismatch occurs, all subsequent draft tokens are discarded, resulting in a limited acceptance rate. Naively batching more draft candidate sequences only introduces a marginal improvement, as redundant or poorly placed branches increase the cost of drafting and verification without proportionally increasing the number of accepted tokens.
We propose D^2SD, a dual diffusion draft speculative decoding framework that organizes candidates into a confidence-guided prefix tree, where the first diffusion drafter generates a block along with per-position confidence scores that are used to identify the most likely rejection boundary and select the top-K prefix ranges for recovery; the second variable-prefix diffusion drafter re-anchors at each selected prefix and proposes alternative continuations in one batched pass; the resulting shared-prefix candidates are jointly verified via cascade attention. Empirically, D^2SD shows clear improvements over both the underlying diffusion approach and strong autoregressive speculative decoding baselines.