🤖 AI Summary
Traditional semi-autoregressive decoding employs fixed block sizes, causing two key issues: high-confidence tokens suffer delayed generation (incurring latency overhead), while low-confidence tokens are prematurely committed (leading to error accumulation). This work is the first to systematically challenge this fixed-block assumption for diffusion-based large language models, proposing a semantic-aware adaptive block scheduling method. Our core contribution is the first identification of an intrinsic correlation between confidence fluctuations in the decoder’s output head and local semantic structures; leveraging this insight, we design a training-free, runtime confidence-driven block-size adaptation mechanism. Integrated with KV cache optimization and block-wise semi-autoregressive decoding, our approach achieves semantic-step adaptive alignment. Experiments demonstrate that, at equivalent throughput, decoding accuracy improves by up to 5.3%, significantly enhancing the accuracy–latency trade-off.
📝 Abstract
Diffusion-based large language models (dLLMs) are gaining attention for their inherent capacity for parallel decoding, offering a compelling alternative to autoregressive LLMs. Among various decoding strategies, blockwise semi-autoregressive (semi-AR) approaches are widely adopted due to their natural support for KV caching and their favorable accuracy-speed trade-off. However, this paper identifies two fundamental limitations in the conventional semi-AR decoding approach that applies a fixed block size: i) late decoding overhead, where the unmasking of high-confidence tokens outside the current block is unnecessarily delayed, and ii) premature decoding error, where low-confidence tokens inside the current block are committed too early, leading to incorrect tokens. This paper presents the first systematic investigation challenging the fixed block size assumption in semi-AR decoding. Through a statistical analysis of confidence dynamics during the denoising process, we identify a volatility band (VB) region during dLLM decoding, which encodes local semantic structure and can be used to guide adaptive block sizing. Leveraging these insights, we introduce AdaBlock-dLLM, a training-free, plug-and-play scheduler that adaptively aligns block boundaries with semantic steps by adjusting block size during runtime. Extensive experiments across diverse benchmarks show that AdaBlock-dLLM achieves up to 5.3% accuracy improvement under the same throughput budget. Beyond inference-time optimization, we hope our semantics-aware adaptive scheduling approach and confidence-based analysis will inspire future training strategies for dLLMs.