Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing

๐Ÿ“… 2025-08-08
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing open-source diffusion large language models (dLLMs) suffer from significantly slower inference speeds compared to autoregressive (AR) models of comparable scale. Method: We propose Discrete Diffusion Forcing (D2F), a novel strategy that reformulates dLLMs into an AR-diffusion hybrid paradigm. D2F enables block-level parallel decoding, KV cache reuse, andโ€”cruciallyโ€”for the first time, cross-block parallel prediction and block-level autoregressive generation. Integrated with asymmetric knowledge distillation and pipelined parallel decoding, D2F constructs an efficient inference architecture atop pre-trained dLLMs. Contribution/Results: Experiments on GSM8K show that our method achieves 2.5ร— higher inference throughput than LLaMA3 and Qwen2.5, and over 50ร— speedup relative to prior dLLM baselines (e.g., LLADa, Dream), while preserving competitive generation quality. This marks the first instance where an open-source dLLM surpasses AR models across both efficiency and capability in practical inference.

Technology Category

Application Category

๐Ÿ“ Abstract
Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs for text generation, with the potential to decode multiple tokens in a single iteration. However, none of the existing open-source dLLMs have achieved superior inference speed over AR LLMs of similar size. This paper breaks this barrier based on a simple and effective strategy named discrete diffusion forcing (D2F). D2F equips dLLMs with two key capabilities: (1) block-wise autoregressive generation to enable KV cache utilization; (2) prediction of following tokens without requiring completion of prior blocks for inter-block parallel decoding. In this way, the vanilla dLLMs are refurbished into an AR-diffusion hybrid paradigm for efficient inference. D2F can be implemented with an asymmetric distillation process based on pre-trained dLLMs. We further propose a pipelined parallel decoding algorithm, which enables a trade-off between efficiency and efficacy. Empirically, D2F dLLMs achieve more than $mathbf{2.5 imes}$ inference speed than LLaMA3 and Qwen2.5 on GSM8K. Compared to vanilla dLLMs like LLaDA and Dream, the acceleration can be more than $mathbf{50 imes}$ while maintaining comparable output quality. The code is available at https://github.com/zhijie-group/Discrete-Diffusion-Forcing.
Problem

Research questions and friction points this paper is trying to address.

Accelerating diffusion LLM inference speed beyond autoregressive models
Enabling parallel decoding without prior block completion requirements
Achieving KV cache utilization in diffusion-based text generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Discrete diffusion forcing for faster inference
Block-wise autoregressive generation with KV cache
Pipelined parallel decoding algorithm for efficiency
๐Ÿ”Ž Similar Papers
X
Xu Wang
Shanghai Jiao Tong University
C
Chenkai Xu
Shanghai Jiao Tong University
Yijie Jin
Yijie Jin
Incoming Ph.D. in Shanghai Jiao Tong University (SJTU)
Efficient Generative AIMachine Learning SystemsMultimodal LearningNatural Language Processing
J
Jiachun Jin
Shanghai Jiao Tong University
H
Hao Zhang
University of California San Diego
Z
Zhijie Deng
Shanghai Jiao Tong University