AdaBlock-dLLM: Semantic-Aware Diffusion LLM Inference via Adaptive Block Size

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Traditional semi-autoregressive decoding employs fixed block sizes, causing two key issues: high-confidence tokens suffer delayed generation (incurring latency overhead), while low-confidence tokens are prematurely committed (leading to error accumulation). This work is the first to systematically challenge this fixed-block assumption for diffusion-based large language models, proposing a semantic-aware adaptive block scheduling method. Our core contribution is the first identification of an intrinsic correlation between confidence fluctuations in the decoder’s output head and local semantic structures; leveraging this insight, we design a training-free, runtime confidence-driven block-size adaptation mechanism. Integrated with KV cache optimization and block-wise semi-autoregressive decoding, our approach achieves semantic-step adaptive alignment. Experiments demonstrate that, at equivalent throughput, decoding accuracy improves by up to 5.3%, significantly enhancing the accuracy–latency trade-off.

Technology Category

Application Category

📝 Abstract

Diffusion-based large language models (dLLMs) are gaining attention for their inherent capacity for parallel decoding, offering a compelling alternative to autoregressive LLMs. Among various decoding strategies, blockwise semi-autoregressive (semi-AR) approaches are widely adopted due to their natural support for KV caching and their favorable accuracy-speed trade-off. However, this paper identifies two fundamental limitations in the conventional semi-AR decoding approach that applies a fixed block size: i) late decoding overhead, where the unmasking of high-confidence tokens outside the current block is unnecessarily delayed, and ii) premature decoding error, where low-confidence tokens inside the current block are committed too early, leading to incorrect tokens. This paper presents the first systematic investigation challenging the fixed block size assumption in semi-AR decoding. Through a statistical analysis of confidence dynamics during the denoising process, we identify a volatility band (VB) region during dLLM decoding, which encodes local semantic structure and can be used to guide adaptive block sizing. Leveraging these insights, we introduce AdaBlock-dLLM, a training-free, plug-and-play scheduler that adaptively aligns block boundaries with semantic steps by adjusting block size during runtime. Extensive experiments across diverse benchmarks show that AdaBlock-dLLM achieves up to 5.3% accuracy improvement under the same throughput budget. Beyond inference-time optimization, we hope our semantics-aware adaptive scheduling approach and confidence-based analysis will inspire future training strategies for dLLMs.

Problem

Research questions and friction points this paper is trying to address.

Fixed block size causes delayed token unmasking

Premature commitment of low-confidence tokens causes errors

Lack of semantic alignment in block boundaries

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive block sizing aligns with semantic steps

Training-free scheduler adjusts block size dynamically

Volatility band analysis guides boundary placement

🔎 Similar Papers

No similar papers found.

Authors to Follow