From Visuals to Vocabulary: Establishing Equivalence Between Image and Text Token Through Autoregressive Pre-training in MLLMs

📅 2025-02-13

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing multimodal large language models (MLLMs) suffer from coarse-grained cross-modal alignment, particularly failing to faithfully reconstruct fine-grained visual details from images, thereby limiting both comprehension and generation capabilities. To address this, we propose Vision-Dynamic Embedding-guided Autoregressive Pretraining (VDEP), the first framework that reformulates multimodal alignment as a bidirectional information recovery task: (i) reconstructing image features from text, and (ii) autoregressively generating token sequences directly from image latent states. VDEP requires no architectural modification; instead, it introduces dynamic visual embedding supervision, autoregressive modeling of image latents, and an MLP-based post-hoc embedding extraction, enabling image tokens to actively participate in language modeling. Evaluated on 13 benchmarks, VDEP consistently outperforms state-of-the-art methods, delivering substantial gains in fine-grained visual reasoning, descriptive captioning, and cross-modal alignment accuracy.

Technology Category

Application Category

📝 Abstract

While MLLMs perform well on perceptual tasks, they lack precise multimodal alignment, limiting performance. To address this challenge, we propose Vision Dynamic Embedding-Guided Pretraining (VDEP), a hybrid autoregressive training paradigm for MLLMs. Utilizing dynamic embeddings from the MLP following the visual encoder, this approach supervises image hidden states and integrates image tokens into autoregressive training. Existing MLLMs primarily focused on recovering information from textual inputs, often neglecting the effective processing of image data. In contrast, the key improvement of this work is the reinterpretation of multimodal alignment as a process of recovering information from input data, with particular emphasis on reconstructing detailed visual features.The proposed method seamlessly integrates into standard models without architectural changes. Experiments on 13 benchmarks show VDEP outperforms baselines, surpassing existing methods.

Problem

Research questions and friction points this paper is trying to address.

Improving multimodal alignment in MLLMs

Enhancing image data processing efficiency

Reconstructing detailed visual features effectively

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision Dynamic Embedding-Guided Pretraining

Dynamic embeddings from visual encoder

Reinterprets multimodal alignment process

🔎 Similar Papers

No similar papers found.

Authors to Follow