Accelerating Speculative Decoding with Block Diffusion Draft Trees

📅 2026-04-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

219K/year
🤖 AI Summary
This work proposes a novel speculative decoding framework that overcomes the limitations of traditional approaches, which validate only a single generation trajectory and thus constrain accepted sequence length and acceleration ratio. The method uniquely integrates block diffusion drafters with a tree-structured verification scheme: it constructs a draft tree using a block diffusion model and employs best-first search to identify high-probability continuation paths, enabling parallel validation of the entire tree within a single forward pass of the target model. To support efficient tree-based inference, the approach introduces an ancestor attention mask. Under a fixed node budget, this design substantially increases the accepted token length and decoding efficiency, achieving state-of-the-art acceleration while preserving the high performance of DFlash.

Technology Category

Application Category

📝 Abstract
Speculative decoding accelerates autoregressive language models by using a lightweight drafter to propose multiple future tokens, which the target model then verifies in parallel. DFlash shows that a block diffusion drafter can generate an entire draft block in a single forward pass and achieve state-of-the-art speculative decoding performance, outperforming strong autoregressive drafters such as EAGLE-3. Vanilla DFlash, however, still verifies only a single drafted trajectory per round, potentially limiting its acceptance length. We introduce DDTree (Diffusion Draft Tree), a method that constructs a draft tree directly from the per-position distributions of a block diffusion drafter. Under a fixed node budget, DDTree uses a simple best-first heap algorithm to select the continuations that are most likely to match the target model according to a surrogate defined by the draft model's output. The resulting tree is verified efficiently in a single target model forward pass using an ancestor-only attention mask. Because DDTree builds on DFlash, a leading draft model for speculative decoding, these gains place DDTree among the leading approaches to speculative decoding.
Problem

Research questions and friction points this paper is trying to address.

speculative decoding
draft tree
acceptance length
block diffusion
autoregressive language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

speculative decoding
block diffusion
draft tree
parallel verification
autoregressive language models
🔎 Similar Papers