MAGE: All-[MASK] Block Already Knows Where to Look in Diffusion LLM

📅 2026-02-15

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work addresses the memory bottleneck imposed by KV cache access in blockwise diffusion language models during long-context reasoning, a challenge that existing sparse attention methods struggle to mitigate effectively. The study makes the novel observation that the attention distribution from the first fully masked denoising step can accurately predict both critical KV positions and the required computational budget. Leveraging this insight, the authors propose a training-free dynamic sparse attention mechanism that uses a single precise computation to guide subsequent efficient sparse denoising steps, further enhanced by lightweight fine-tuning. Evaluated on benchmarks such as LongBench and Needle-in-a-Haystack, the method achieves near-lossless accuracy with extremely low KV budgets and delivers 3–4× end-to-end inference speedup, substantially outperforming sparse attention baselines designed for autoregressive models.

Technology Category

Application Category

📝 Abstract

Block diffusion LLMs are emerging as a promising next paradigm for language generation, but their use of KV caching makes memory access a dominant bottleneck in long-context settings. While dynamic sparse attention has been actively explored, existing methods designed for autoregressive LLMs rely on approximate importance estimation and perform poorly when adapted to block diffusion. This work identifies a key opportunity unique to block diffusion: attention at the first All-[MASK] denoising step reliably predicts important KV entries and budget requirements, enabling MAGE to perform a single exact attention pass per block and reuse it for training-free sparse denoising. Across long-context benchmarks including LongBench and Needle-in-a-Haystack, MAGE achieves near-lossless accuracy with a fraction of the KV budget while delivering up to 3-4x end-to-end speedup, consistently outperforming AR-oriented sparse attention baselines. A lightweight fine-tuning strategy further strengthens [MASK]-guided patterns with minimal cost, requiring only a few hours of training on a single NVIDIA H100 GPU for both 1.5B and 7B models.

Problem

Research questions and friction points this paper is trying to address.

block diffusion LLMs

KV caching

long-context

sparse attention

memory bottleneck

Innovation

Methods, ideas, or system contributions that make the work stand out.

block diffusion

sparse attention

KV caching