TAPS: Target-Aware Prefix Tree Selection for Diffusion-Drafted Speculative Decoding

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Existing diffusion-based speculative decoding methods suffer from increased latency and limited gains due to the introduction of invalid tokens during verification, stemming from reliance on marginal probability ranking. This work proposes TAPS, a novel approach that introduces, for the first time, a target-aware prefix selection mechanism. TAPS transforms marginal probabilities generated by diffusion models into path-conditional acceptance estimates and selects a compact prefix-closed subtree under a fixed verification budget, aligning the verification process with prefix conditions to eliminate redundant computations. By integrating parallel draft generation, path-conditional acceptance estimation, and subtree selection, TAPS achieves up to 7.9× lossless end-to-end speedup across multiple datasets and models, outperforming DFlash and DDTree by factors of 1.36× and 1.74×, respectively.

📝 Abstract

Using a diffusion model for parallel drafting is a promising approach for speculative decoding. By predicting tokens at multiple future positions in a single forward pass, diffusion drafters substantially reduce drafting latency. However, this shifts the bottleneck to verification: verifying a single sequence limits acceptance length, while verifying large draft trees incurs excessive target-model latency. We identify a key mismatch in existing draft-tree methods: existing diffusion-tree methods rank nodes by the marginal probability, ignoring that verification is prefix-conditioned. As a result, they may verify unreachable descendants of rejected prefixes, increasing latency with limited acceptance gains. To address this, we propose TAPS, a target-aware prefix selection method that turns diffusion marginals into path-conditioned acceptance estimates. TAPS then selects a compact prefix-closed subtree under a fixed verification budget, improving the acceptance-cost tradeoff rather than simply expanding the draft tree. Experiments across diverse datasets and model families demonstrate that TAPS achieves up to 7.9x lossless end-to-end speedup over vanilla autoregressive decoding, outperforming state-of-the-art DFlash and DDTree by 1.36x and 1.74x respectively. Our work is available at https://anonymous.4open.science/r/TAPS-EMNLP2026-53DD

Problem

Research questions and friction points this paper is trying to address.

speculative decoding

diffusion model

verification bottleneck

prefix-conditioned verification

draft tree

Innovation

Methods, ideas, or system contributions that make the work stand out.

speculative decoding

diffusion model

prefix tree selection