π€ AI Summary
Diffusion language models (dLLMs) struggle to leverage efficient speculative decoding techniques common in autoregressive models due to their reliance on masked language modeling and bidirectional attention. This work proposes a plug-and-play masking strategy that constructs temporally coherent token-level contexts, enabling dLLMs to verify multiple draft tokens in parallel within a single forward passβthus introducing token-level speculative decoding without any additional training. The approach preserves the inherent parallelism of dLLMs and seamlessly integrates with acceleration techniques such as KV caching and block-wise decoding. Experiments demonstrate up to a 7.46Γ improvement in decoding throughput on four benchmarks using SDAR-family models, while maintaining or slightly enhancing generation quality.
π Abstract
Diffusion large language models (dLLMs) have recently emerged as a promising alternative to autoregressive (AR) LLMs, offering faster inference through parallel or blockwise decoding. However, their masked language modeling formulation remains incompatible with standard token-level speculative decoding, one of the most effective acceleration techniques for AR models. In AR decoding, the causal mask preserves temporally valid token-level contexts, enabling a target model to verify multiple drafted tokens in a single forward pass. In contrast, dLLMs rely on mask tokens and bidirectional attention, causing the effective context to change across denoising steps and preventing direct token-level speculative verification. To bridge this gap, we propose a simple but effective speculative decoding algorithm for diffusion language models, named SimSD, which mainly adopts a plug-and-play masking strategy that equips dLLMs with temporally valid token-level contexts for speculative decoding. Our method explicitly introduces reference tokens from draft-model predictions and designs an attention mask that regulates their interaction with current-step tokens, allowing dLLMs to compute valid logits for drafted tokens in a single forward pass. This restores the key verification ability provided by causal masking in AR models while preserving the parallel decoding advantages of dLLMs. The proposed method is training-free and can be flexibly integrated with other acceleration techniques such as KV cache and blockwise decoding. Experiments on SDAR-family dLLMs across four benchmarks show that our method achieves up to 7.46x higher decoding throughput while maintaining and even improving average generation quality.