DSL-LLaDA: Scaling Continuous Denoising to 8B Masked Diffusion LMs

📅 2026-05-31

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses the challenge of balancing generation length and quality in few-step decoding with existing discrete masked diffusion language models. Building upon LLaDA-8B-Instruct, the authors propose a method that replaces binary masks with continuous Gaussian noise through an additional 1,000-step pretraining phase using Discrete Stochastic Localization (DSL), thereby achieving joint continuous denoising in the embedding space for the first time in large-scale masked diffusion language models. This approach enables lightweight adaptation, effectively mitigating premature termination and repetition issues while exhibiting robustness to selective noise. Evaluated on zero-shot summarization with ≤16 decoding steps, DSL-LLaDA-SDE achieves state-of-the-art ROUGE-1 scores across all four benchmarks, significantly alleviating the length–quality trade-off and demonstrating the ability to correct corrupted tokens while preserving clean content.

📝 Abstract

Discrete Masked diffusion language models generate text by iterative parallel decoding, but few-step decoding suffers from a tradeoff between length and quality: with a fixed step budget, standard methods can generate a short, high-quality output, or they can produce long but repetitive text. Continuous denoising can sidestep this tradeoff by evolving all positions jointly in embedding space, but building such a model from scratch at scale remains an open problem. We show that a pretrained masked DLM can instead be lightly adapted to support continuous embedding-space denoising. Starting from LLaDA-8B-Instruct, we continue-pretrain for only 1,000 steps with Discrete Stochastic Localization (DSL), replacing binary masking with continuous per-token Gaussian noise as a soft mask. The adapted model supports continuous inference that evolves all positions jointly in embedding space and defers hard token commitment to the final step. On zero-shot summarization at low step budgets (<=16 forward passes), DSL-LLaDA-SDE achieves the best ROUGE-1 on all four benchmarks and largely avoids the premature-termination / repetition tradeoff of iterative unmasking. The same adaptation also yields selective noisy-state robustness: the model corrects corrupted tokens while preserving clean ones. Control experiments using standard masked diffusion training with the same compute demonstrate neither behavior.

Problem

Research questions and friction points this paper is trying to address.

masked diffusion language models

continuous denoising

length-quality tradeoff

iterative decoding

embedding-space evolution

Innovation

Methods, ideas, or system contributions that make the work stand out.

continuous denoising

masked diffusion language model

Discrete Stochastic Localization