SimSD: Simple Speculative Decoding in Diffusion Language Models

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Diffusion language models (dLLMs) struggle to leverage efficient speculative decoding techniques common in autoregressive models due to their reliance on masked language modeling and bidirectional attention. This work proposes a plug-and-play masking strategy that constructs temporally coherent token-level contexts, enabling dLLMs to verify multiple draft tokens in parallel within a single forward pass—thus introducing token-level speculative decoding without any additional training. The approach preserves the inherent parallelism of dLLMs and seamlessly integrates with acceleration techniques such as KV caching and block-wise decoding. Experiments demonstrate up to a 7.46× improvement in decoding throughput on four benchmarks using SDAR-family models, while maintaining or slightly enhancing generation quality.

📝 Abstract

Diffusion large language models (dLLMs) have recently emerged as a promising alternative to autoregressive (AR) LLMs, offering faster inference through parallel or blockwise decoding. However, their masked language modeling formulation remains incompatible with standard token-level speculative decoding, one of the most effective acceleration techniques for AR models. In AR decoding, the causal mask preserves temporally valid token-level contexts, enabling a target model to verify multiple drafted tokens in a single forward pass. In contrast, dLLMs rely on mask tokens and bidirectional attention, causing the effective context to change across denoising steps and preventing direct token-level speculative verification. To bridge this gap, we propose a simple but effective speculative decoding algorithm for diffusion language models, named SimSD, which mainly adopts a plug-and-play masking strategy that equips dLLMs with temporally valid token-level contexts for speculative decoding. Our method explicitly introduces reference tokens from draft-model predictions and designs an attention mask that regulates their interaction with current-step tokens, allowing dLLMs to compute valid logits for drafted tokens in a single forward pass. This restores the key verification ability provided by causal masking in AR models while preserving the parallel decoding advantages of dLLMs. The proposed method is training-free and can be flexibly integrated with other acceleration techniques such as KV cache and blockwise decoding. Experiments on SDAR-family dLLMs across four benchmarks show that our method achieves up to 7.46x higher decoding throughput while maintaining and even improving average generation quality.

Problem

Research questions and friction points this paper is trying to address.

diffusion language models

speculative decoding

masked language modeling

token-level verification

parallel decoding

Innovation

Methods, ideas, or system contributions that make the work stand out.

speculative decoding

diffusion language models

attention masking