UltraLLaDA: Scaling the Context Length to 128K for Diffusion Large Language Models

📅 2025-10-12

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work addresses the lack of efficient context-length extension methods for diffusion-based large language models (e.g., LLaDA) in long-context scenarios. We propose a post-training context extension framework that avoids full retraining. Our method introduces two key innovations: (1) a modified rotary position embedding tailored to the probabilistic modeling characteristics of the diffusion process; and (2) a long-range memory-optimized masking strategy that enhances training stability and information retention. Applying this framework, we extend LLaDA’s context window to 128K tokens, yielding UltraLLaDA. Empirical evaluation across multiple long-text understanding and generation benchmarks demonstrates that UltraLLaDA significantly outperforms zero-shot baselines, validating the feasibility and effectiveness of diffusion architectures for ultra-long-context modeling.

Technology Category

Application Category

📝 Abstract

Diffusion LLMs have attracted growing interest, with plenty of recent work emphasizing their great potential in various downstream tasks; yet the long-context behavior of diffusion LLMs remains largely uncharted. We present a case study of post-training techniques for extending the context window of diffusion LLMs (i.e., LLaDA) without retraining from scratch. We show that a simple modification to the standard Rotary Positional Embeddings (RoPE) extension effectively accommodates the probabilistic modeling inherent in the diffusion process, enabling stable scaling to longer context ranges. We further compare masking strategies used during post-training and analyze their impact on optimization stability and long-range recall. Instantiating these insights, we introduce UltraLLaDA, a diffusion LLM with a 128K-token context window that, in our empirical evaluation on long-context tasks, significantly outperforms training-free baselines. Our experimental results highlight the special positional extension as a key lever for scaling diffusion LLMs to extended contexts and offer practical guidance for practitioners seeking 128K-scale context via efficient post-training.

Problem

Research questions and friction points this paper is trying to address.

Extending context window of diffusion LLMs without full retraining

Modifying RoPE to support diffusion probabilistic modeling

Developing 128K-token diffusion LLM for long-context tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Modified RoPE for diffusion probabilistic modeling

Compared masking strategies for optimization stability

Achieved 128K context via efficient post-training techniques

🔎 Similar Papers

No similar papers found.