Growing Visual Generative Capacity for Pre-Trained MLLMs

📅 2025-10-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing unified multimodal large language models (MLLMs) struggle to simultaneously achieve semantic alignment and pixel-level fidelity, often relying on non-autoregressive paradigms. This paper introduces Bridge, a purely autoregressive unified MLLM featuring a novel Mixture-of-Transformers architecture that endows pre-trained vision understanding models with native generative capability. Its core innovation is a discrete semantic-to-pixel joint representation—integrating compact semantic tokens with fine-grained pixel tokens—yielding substantial improvements in language alignment and generation detail while increasing sequence length by only 7.9%. By unifying multimodal understanding and generation under a single autoregressive framework and jointly learning semantic–pixel representations, Bridge achieves state-of-the-art or leading performance across diverse multimodal understanding and generation benchmarks, despite requiring significantly less training data and shorter training time.

Technology Category

Application Category

📝 Abstract
Multimodal large language models (MLLMs) extend the success of language models to visual understanding, and recent efforts have sought to build unified MLLMs that support both understanding and generation. However, constructing such models remains challenging: hybrid approaches combine continuous embeddings with diffusion or flow-based objectives, producing high-quality images but breaking the autoregressive paradigm, while pure autoregressive approaches unify text and image prediction over discrete visual tokens but often face trade-offs between semantic alignment and pixel-level fidelity. In this work, we present Bridge, a pure autoregressive unified MLLM that augments pre-trained visual understanding models with generative ability through a Mixture-of-Transformers architecture, enabling both image understanding and generation within a single next-token prediction framework. To further improve visual generation fidelity, we propose a semantic-to-pixel discrete representation that integrates compact semantic tokens with fine-grained pixel tokens, achieving strong language alignment and precise description of visual details with only a 7.9% increase in sequence length. Extensive experiments across diverse multimodal benchmarks demonstrate that Bridge achieves competitive or superior results in both understanding and generation benchmarks, while requiring less training data and reduced training time compared to prior unified MLLMs.
Problem

Research questions and friction points this paper is trying to address.

Enabling autoregressive unified models for visual understanding and generation
Improving semantic alignment and pixel fidelity in visual token prediction
Reducing training data and time requirements for multimodal models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoregressive unified MLLM with Mixture-of-Transformers architecture
Semantic-to-pixel discrete representation for visual fidelity
Single next-token prediction framework for multimodal tasks
🔎 Similar Papers
No similar papers found.