🤖 AI Summary
This work addresses the inference inefficiency of autoregressive decoding in large language models (LLMs). We propose a novel paradigm that systematically transforms pretrained autoregressive LLMs into discrete diffusion LLMs (dLLMs) at the 100B-parameter scale. Methodologically, we introduce a three-stage block-level Weighted Sequence Distillation (WSD) training schedule—comprising warm-up, stable, and decay phases—integrated with a sparse Mixture-of-Experts (MoE) architecture and discrete token-level diffusion modeling, alongside supervised fine-tuning (SFT) and Direct Preference Optimization (DPO). To our knowledge, this is the first demonstration of a MoE-based diffusion LLM’s feasibility and practicality at the 100B scale. We open-source two instruction-tuned models: LLaDA2.0-mini (16B) and LLaDA2.0-flash (100B), which retain parallel decoding advantages while achieving substantial gains in inference speed and task performance. Our approach establishes a viable pathway for deploying ultra-large-scale dLLMs.
📝 Abstract
This paper presents LLaDA2.0 -- a tuple of discrete diffusion large language models (dLLM) scaling up to 100B total parameters through systematic conversion from auto-regressive (AR) models -- establishing a new paradigm for frontier-scale deployment. Instead of costly training from scratch, LLaDA2.0 upholds knowledge inheritance, progressive adaption and efficiency-aware design principle, and seamless converts a pre-trained AR model into dLLM with a novel 3-phase block-level WSD based training scheme: progressive increasing block-size in block diffusion (warm-up), large-scale full-sequence diffusion (stable) and reverting back to compact-size block diffusion (decay). Along with post-training alignment with SFT and DPO, we obtain LLaDA2.0-mini (16B) and LLaDA2.0-flash (100B), two instruction-tuned Mixture-of-Experts (MoE) variants optimized for practical deployment. By preserving the advantages of parallel decoding, these models deliver superior performance and efficiency at the frontier scale. Both models were open-sourced.