Mosaic: Unlocking Long-Context Inference for Diffusion LLMs via Global Memory Planning and Dynamic Peak Taming

📅 2026-01-10
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the severe memory peak and efficiency bottlenecks in diffusion-based large language models (dLLMs) during long-context reasoning, which stem from transient activation recomputation and localized memory management. To mitigate these issues, the authors propose Mosaic, a system that introduces, for the first time, a global dynamic memory scheduling mechanism. Mosaic integrates peak memory prediction, virtual address space management, and a lazy chunking optimizer to jointly reduce memory redundancy and fragmentation. Experimental results demonstrate that Mosaic reduces the peak-to-average memory ratio by 2.71× on average, enables 15.89–32.98× longer maximum sequence lengths under the same hardware constraints, and achieves latency reductions of 4.12%–23.26%, all while preserving generation accuracy.

Technology Category

Application Category

📝 Abstract
Diffusion-based large language models (dLLMs) have emerged as a promising paradigm, utilizing simultaneous denoising to enable global planning and iterative refinement. While these capabilities are particularly advantageous for long-context generation, deploying such models faces a prohibitive memory capacity barrier stemming from severe system inefficiencies. We identify that existing inference systems are ill-suited for this paradigm: unlike autoregressive models constrained by the cumulative KV-cache, dLLMs are bottlenecked by transient activations recomputed at every step. Furthermore, general-purpose memory reuse mechanisms lack the global visibility to adapt to dLLMs'dynamic memory peaks, which toggle between logits and FFNs. To address these mismatches, we propose Mosaic, a memory-efficient inference system that shifts from local, static management to a global, dynamic paradigm. Mosaic integrates a mask-only logits kernel to eliminate redundancy, a lazy chunking optimizer driven by an online heuristic search to adaptively mitigate dynamic peaks, and a global memory manager to resolve fragmentation via virtual addressing. Extensive evaluations demonstrate that Mosaic achieves an average 2.71$\times$ reduction in the memory peak-to-average ratio and increases the maximum inference sequence length supportable on identical hardware by 15.89-32.98$\times$. This scalability is achieved without compromising accuracy and speed, and in fact reducing latency by 4.12%-23.26%.
Problem

Research questions and friction points this paper is trying to address.

diffusion LLMs
long-context inference
memory bottleneck
dynamic memory peaks
transient activations
Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion LLMs
global memory planning
dynamic peak taming
memory-efficient inference
virtual addressing
🔎 Similar Papers
No similar papers found.
L
Liang Zheng
Tianjin University, China
B
Bowen Shi
Tianjin University, China
Yitao Hu
Yitao Hu
Professor, Tianjin University
LLM SystemDNN SystemAI for Science
J
Jiawei Zhang
Tianjin University, China
R
Ruofan Li
Tianjin University, China
S
Sheng Chen
Tianjin University, China
W
Wenxin Li
Tianjin University, China
K
Keqiu Li
Tianjin University, China