π€ AI Summary
This work addresses error propagation in parallel decoding caused by prediction coupling among high-confidence positions in masked diffusion language models, a problem exacerbated by existing training-free samplers that ignore interactions among candidate tokens. The authors propose ADAS, a training-free reranking mechanism that, while preserving the original samplerβs stopping strategy, introduces continuous attention as a soft marginal penalty for the first time. This attention-based soft penalty dynamically discounts candidate tokens strongly correlated with uncertain already-selected positions. Combined with a greedy discounting strategy, ADAS seamlessly integrates into training-free frameworks such as Top-k and Fast-dLLM. Experiments demonstrate average generation quality improvements of 9.11 and 10.46 percentage points under low NFE settings on LLaDA-8B-Base and Dream-7B-Base, respectively, with only a 3.1% increase in forward-pass overhead.
π Abstract
Masked diffusion language models can reduce inference steps by revealing multiple tokens per denoising iteration, but this parallelism is fragile: positions that are individually confident may be unsafe to commit together when their predictions are coupled. Existing training-free samplers such as Top-\(k\), Fast-dLLM, and EB-Sampler mainly control how many tokens to reveal, while often ranking candidates by token-wise scores that ignore interactions within the selected set. We propose ADAS, a training-free reranking rule for parallel masked diffusion decoding. ADAS leaves the base sampler's stopping rule unchanged and modifies only subset construction: it greedily discounts a candidate when it attends strongly to already selected positions whose predictions remain uncertain. Unlike graph-constrained methods that turn attention into hard compatibility constraints, ADAS keeps attention continuous and uses it as a soft marginal penalty. Across LLaDA-8B-Base and Dream-7B-Base on GSM8K, MATH500, HumanEval, and MBPP, plugging ADAS into Top-\(k\), Fast-dLLM, and EB-Sampler improves low-NFE performance at matched denoiser evaluations by \(9.11\) and \(10.46\) percentage points on average, respectively, with \(3.1\%\) per-forward runtime overhead. These results show that soft attention-discounted reranking is a simple and modular way to improve quality in highly parallel decoding for masked diffusion language models.