Prefilling-dLLM: Predictive Prefilling for Long-Context Inference in Diffusion Language Models

๐Ÿ“… 2026-06-09
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the quadratic computational complexity of diffusion language models (dLLMs) in long-context reasoning, which arises from re-encoding the entire prefix at each denoising step. The authors propose a training-free, prefill-decode decoupling framework that caches prefix key-value states in blocks, selects the top-K most relevant sparse blocks for decoding based on relevance scoring, and introduces anchor tokens at block beginnings to mitigate the โ€œlost-in-the-middleโ€ phenomenon. This design enables parallel decoding over non-contiguous cached blocks. Combined with intra-block token sparsification and a custom attention kernel, the method achieves the first demonstration in dLLMs where sparse prefilling outperforms dense attention, setting state-of-the-art acceleration results among existing dLLM approaches on LongBench and InfiniteBench, with speedups of 9.1โ€“28.0ร— across context lengths from 8K to 32K.
๐Ÿ“ Abstract
Diffusion large language models (dLLMs) re-encode the entire prefix at every denoising step, causing recomputation that scales quadratically with context length and becomes prohibitive for long-context scenarios. We propose Prefilling-dLLM, a training-free prefill-decode disaggregation framework for dLLMs that partitions the prefix into N chunks, caches their KV representations once, and selects the top-K most relevant chunks with intra-chunk token sparsity for decoding, showing that sparse prefilling can outperform dense attention while reducing per-step complexity from quadratic in the full sequence length to quadratic only in the decode length. On LongBench and InfiniteBench, Prefilling-dLLM achieves state-of-the-art quality among dLLM acceleration methods, and an attention kernel that parallelizes decoding over the non-contiguously cached chunk KV yields 9.1--28.0x speedup at 8K--32K contexts. We further show that beginning-of-sequence tokens prepended to each chunk act as periodic attention anchors that eliminate the lost-in-the-middle phenomenon. Code is available at https://github.com/menik1126/Prefilling-dLLM.
Problem

Research questions and friction points this paper is trying to address.

diffusion language models
long-context inference
quadratic complexity
recomputation
prefilling
Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion language models
prefill-decode disaggregation
sparse attention
KV caching
long-context inference