LLaDA2.0: Scaling Up Diffusion Language Models to 100B

📅 2025-12-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the inference inefficiency of autoregressive decoding in large language models (LLMs). We propose a novel paradigm that systematically transforms pretrained autoregressive LLMs into discrete diffusion LLMs (dLLMs) at the 100B-parameter scale. Methodologically, we introduce a three-stage block-level Weighted Sequence Distillation (WSD) training schedule—comprising warm-up, stable, and decay phases—integrated with a sparse Mixture-of-Experts (MoE) architecture and discrete token-level diffusion modeling, alongside supervised fine-tuning (SFT) and Direct Preference Optimization (DPO). To our knowledge, this is the first demonstration of a MoE-based diffusion LLM’s feasibility and practicality at the 100B scale. We open-source two instruction-tuned models: LLaDA2.0-mini (16B) and LLaDA2.0-flash (100B), which retain parallel decoding advantages while achieving substantial gains in inference speed and task performance. Our approach establishes a viable pathway for deploying ultra-large-scale dLLMs.

Technology Category

Application Category

📝 Abstract
This paper presents LLaDA2.0 -- a tuple of discrete diffusion large language models (dLLM) scaling up to 100B total parameters through systematic conversion from auto-regressive (AR) models -- establishing a new paradigm for frontier-scale deployment. Instead of costly training from scratch, LLaDA2.0 upholds knowledge inheritance, progressive adaption and efficiency-aware design principle, and seamless converts a pre-trained AR model into dLLM with a novel 3-phase block-level WSD based training scheme: progressive increasing block-size in block diffusion (warm-up), large-scale full-sequence diffusion (stable) and reverting back to compact-size block diffusion (decay). Along with post-training alignment with SFT and DPO, we obtain LLaDA2.0-mini (16B) and LLaDA2.0-flash (100B), two instruction-tuned Mixture-of-Experts (MoE) variants optimized for practical deployment. By preserving the advantages of parallel decoding, these models deliver superior performance and efficiency at the frontier scale. Both models were open-sourced.
Problem

Research questions and friction points this paper is trying to address.

Scales diffusion language models to 100B parameters efficiently
Converts autoregressive models into discrete diffusion models via novel training
Optimizes instruction-tuned MoE variants for practical deployment and open-sourcing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Converts AR models to diffusion via block-level training
Uses 3-phase progressive block diffusion for efficiency
Implements MoE variants with SFT/DPO alignment
🔎 Similar Papers
No similar papers found.