🤖 AI Summary
Existing diffusion language models rely solely on local token information during sampling, neglecting global sequence structure and thus struggling to balance generation quality with parallel efficiency. This work formulates the sampling order selection as an NP-hard optimization problem for the first time and introduces Attn-Sampler, a training-free algorithm that leverages a computationally tractable approximation based on descending column sums of the attention matrix. By integrating attention mechanism analysis, sampling rank approximation, and dynamic thresholding for acceleration, the proposed method significantly outperforms baseline approaches such as greedy search across multiple benchmarks. It simultaneously enhances both text generation quality and parallelizability, offering a theoretically grounded and practically effective framework for attention-guided sampling in diffusion-based language models.
📝 Abstract
Auto-regressive models (ARMs) have established a dominant paradigm in language modeling. However, their strictly sequential decoding paradigm imposes fundamental constraints on both inference efficiency and modeling flexibility. To address these limitations, diffusion-based large language models (dLLMs) have been proposed, offering the potential for parallel decoding and flexible language modeling. Despite these advantages, current dLLMs decoding strategies rely primarily on token level information, which fails to account for global sequence structure and often yields suboptimal results. In this paper, we study the decoding order selection problem from the perspective of log-likelihood maximization. We theoretically demonstrate that optimal sequence likelihood can be approximately achieved by decoding tokens in descending order of their attention matrix column sums. This finding provides a principled justification for attention-guided decoding and offers a theoretically grounded alternative to greedy search. We instantiate this theoretical insight in a new training-free decoding algorithm, termed Attn-Sampler, and further propose a block attention approximation and dynamic attention thresholding for practical acceleration. Extensive experiments across multiple benchmarks validate the effectiveness of our proposed method, demonstrating that it achieves superior generation quality while enhancing the decoding parallelism.