Nexus-Gen: A Unified Model for Image Understanding, Generation, and Editing

📅 2025-04-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing open-source unified multimodal large language models (MLLMs) exhibit significantly weaker joint understanding and generation capabilities compared to specialized architectures. Method: We propose Uni-Multimodal, the first end-to-end unified framework that deeply integrates the reasoning capacity of autoregressive language models with the image synthesis capability of diffusion models. Our approach features: (1) a two-stage joint embedding space alignment training strategy; and (2) a novel position-encoding-guided pre-filling autoregressive strategy to mitigate cumulative errors in continuous latent spaces. Contribution/Results: Uni-Multimodal supports integrated image understanding, generation, and editing, achieving state-of-the-art performance across diverse multimodal benchmarks and substantially narrowing the performance gap with task-specific architectures. All code, models, and datasets are publicly released.

Technology Category

Application Category

📝 Abstract
Unified multimodal large language models (MLLMs) aim to integrate multimodal understanding and generation abilities through a single framework. Despite their versatility, existing open-source unified models exhibit performance gaps against domain-specific architectures. To bridge this gap, we present Nexus-Gen, a unified model that synergizes the language reasoning capabilities of LLMs with the image synthesis power of diffusion models. To align the embedding space of the LLM and diffusion model, we conduct a dual-phase alignment training process. (1) The autoregressive LLM learns to predict image embeddings conditioned on multimodal inputs, while (2) the vision decoder is trained to reconstruct high-fidelity images from these embeddings. During training the LLM, we identified a critical discrepancy between the autoregressive paradigm's training and inference phases, where error accumulation in continuous embedding space severely degrades generation quality. To avoid this issue, we introduce a prefilled autoregression strategy that prefills input sequence with position-embedded special tokens instead of continuous embeddings. Through dual-phase training, Nexus-Gen has developed the integrated capability to comprehensively address the image understanding, generation and editing tasks. All models, datasets, and codes are published at https://github.com/modelscope/Nexus-Gen.git to facilitate further advancements across the field.
Problem

Research questions and friction points this paper is trying to address.

Bridging performance gaps in unified multimodal models
Aligning LLM and diffusion model embedding spaces
Mitigating error accumulation in autoregressive image generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified model combining LLMs and diffusion models
Dual-phase alignment training for embedding space
Prefilled autoregression strategy to reduce errors
🔎 Similar Papers
No similar papers found.
H
Hong Zhang
College of Control Science and Engineering, Zhejiang University
Zhongjie Duan
Zhongjie Duan
East China Normal University
Image Synthesis
X
Xingjun Wang
ModelScope Team, Alibaba Group Inc.
Yingda Chen
Yingda Chen
Alibaba Group, Microsoft
Y
Yuze Zhao
ModelScope Team, Alibaba Group Inc.
Y
Yu Zhang
College of Control Science and Engineering, Zhejiang University