🤖 AI Summary
Existing block decoding approaches for diffusion language models rely on fixed block sizes or explicit delimiters, which often misalign with semantic boundaries and thereby limit generation quality and efficiency. This work proposes SemBlock, a novel framework that, for the first time, integrates semantic boundary prediction into diffusion-based language modeling. SemBlock employs a lightweight boundary predictor—operating on frozen LLaDA hidden states—to dynamically determine optimal block termination points during decoding. To support this approach, we introduce SemBound, a multi-domain dataset of semantic boundaries spanning discourse units, reasoning steps, and code segments. Experiments demonstrate that SemBlock substantially outperforms both fixed-block decoding and AdaBlock across GSM8K, IFEval, MATH, and HumanEval benchmarks, yielding more natural and efficient text generation.
📝 Abstract
Diffusion language models (DLMs) generate text through iterative denoising, and blockwise decoding improves their practicality by committing tokens in local blocks. However, existing blockwise methods typically rely on fixed block sizes or delimiter-based runtime signals, which do not necessarily align with semantic boundaries. In this paper, we propose SemBlock, a semantic-boundary-driven dynamic block decoding framework for diffusion LLMs. SemBlock formulates dynamic block construction as semantic boundary prediction and trains lightweight predictors on frozen LLaDA hidden states. To provide supervision, we construct SemBound, a semantic-boundary dataset that derives boundary labels from discourse units, reasoning steps, and implementation spans across natural language, math, and code tasks. During inference, SemBlock uses predicted boundary probabilities to select the ending position of each dynamic block. Experiments on GSM8K, IFEval, MATH, and HumanEval show that SemBlock consistently improves over fixed-block decoding and AdaBlock. Our code is publicly available: https://github.com/TH-AI-Lab-PKU/SemBlock.