From Visuals to Vocabulary: Establishing Equivalence Between Image and Text Token Through Autoregressive Pre-training in MLLMs

📅 2025-02-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal large language models (MLLMs) suffer from coarse-grained cross-modal alignment, particularly failing to faithfully reconstruct fine-grained visual details from images, thereby limiting both comprehension and generation capabilities. To address this, we propose Vision-Dynamic Embedding-guided Autoregressive Pretraining (VDEP), the first framework that reformulates multimodal alignment as a bidirectional information recovery task: (i) reconstructing image features from text, and (ii) autoregressively generating token sequences directly from image latent states. VDEP requires no architectural modification; instead, it introduces dynamic visual embedding supervision, autoregressive modeling of image latents, and an MLP-based post-hoc embedding extraction, enabling image tokens to actively participate in language modeling. Evaluated on 13 benchmarks, VDEP consistently outperforms state-of-the-art methods, delivering substantial gains in fine-grained visual reasoning, descriptive captioning, and cross-modal alignment accuracy.

Technology Category

Application Category

📝 Abstract
While MLLMs perform well on perceptual tasks, they lack precise multimodal alignment, limiting performance. To address this challenge, we propose Vision Dynamic Embedding-Guided Pretraining (VDEP), a hybrid autoregressive training paradigm for MLLMs. Utilizing dynamic embeddings from the MLP following the visual encoder, this approach supervises image hidden states and integrates image tokens into autoregressive training. Existing MLLMs primarily focused on recovering information from textual inputs, often neglecting the effective processing of image data. In contrast, the key improvement of this work is the reinterpretation of multimodal alignment as a process of recovering information from input data, with particular emphasis on reconstructing detailed visual features.The proposed method seamlessly integrates into standard models without architectural changes. Experiments on 13 benchmarks show VDEP outperforms baselines, surpassing existing methods.
Problem

Research questions and friction points this paper is trying to address.

Improving multimodal alignment in MLLMs
Enhancing image data processing efficiency
Reconstructing detailed visual features effectively
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision Dynamic Embedding-Guided Pretraining
Dynamic embeddings from visual encoder
Reinterprets multimodal alignment process
🔎 Similar Papers
No similar papers found.
M
Mingxiao Li
Tencent Hunyuan
F
Fang Qu
University of Science and Technology of China
Zhanpeng Chen
Zhanpeng Chen
Peking University
Vision-language Model
N
Na Su
Tencent WXG Group
Zhizhou Zhong
Zhizhou Zhong
PhD student @ HKUST
face recognitionbiometricsaigc
Ziyang Chen
Ziyang Chen
Peking University
Quantum key distributionQuantum random number generation
N
Nan Du
Tencent Hunyuan
X
Xiaolong Li
Tencent Hunyuan