MAGE: All-[MASK] Block Already Knows Where to Look in Diffusion LLM

📅 2026-02-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the memory bottleneck imposed by KV cache access in blockwise diffusion language models during long-context reasoning, a challenge that existing sparse attention methods struggle to mitigate effectively. The study makes the novel observation that the attention distribution from the first fully masked denoising step can accurately predict both critical KV positions and the required computational budget. Leveraging this insight, the authors propose a training-free dynamic sparse attention mechanism that uses a single precise computation to guide subsequent efficient sparse denoising steps, further enhanced by lightweight fine-tuning. Evaluated on benchmarks such as LongBench and Needle-in-a-Haystack, the method achieves near-lossless accuracy with extremely low KV budgets and delivers 3–4× end-to-end inference speedup, substantially outperforming sparse attention baselines designed for autoregressive models.

Technology Category

Application Category

📝 Abstract
Block diffusion LLMs are emerging as a promising next paradigm for language generation, but their use of KV caching makes memory access a dominant bottleneck in long-context settings. While dynamic sparse attention has been actively explored, existing methods designed for autoregressive LLMs rely on approximate importance estimation and perform poorly when adapted to block diffusion. This work identifies a key opportunity unique to block diffusion: attention at the first All-[MASK] denoising step reliably predicts important KV entries and budget requirements, enabling MAGE to perform a single exact attention pass per block and reuse it for training-free sparse denoising. Across long-context benchmarks including LongBench and Needle-in-a-Haystack, MAGE achieves near-lossless accuracy with a fraction of the KV budget while delivering up to 3-4x end-to-end speedup, consistently outperforming AR-oriented sparse attention baselines. A lightweight fine-tuning strategy further strengthens [MASK]-guided patterns with minimal cost, requiring only a few hours of training on a single NVIDIA H100 GPU for both 1.5B and 7B models.
Problem

Research questions and friction points this paper is trying to address.

block diffusion LLMs
KV caching
long-context
sparse attention
memory bottleneck
Innovation

Methods, ideas, or system contributions that make the work stand out.

block diffusion
sparse attention
KV caching
All-[MASK] denoising
training-free acceleration
🔎 Similar Papers
No similar papers found.
O
Omin Kwon
Department of Computer Science and Engineering, Seoul National University, Seoul, South Korea
Y
Yeonjae Kim
Department of Computer Science and Engineering, Seoul National University, Seoul, South Korea
D
Doyeon Kim
Department of Computer Science and Engineering, Seoul National University, Seoul, South Korea
Minseo Kim
Minseo Kim
Department of Industrial Engineering, Yonsei University
Yeonhong Park
Yeonhong Park
Seoul National University
Computer ArchitectureComputer SystemsML Systems
Jae W. Lee
Jae W. Lee
Professor of Computer Science and Engineering, Seoul National University, Korea
Computer ArchitectureParallel ProgrammingCompilersVLSI DesignHardware Security