InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

📅 2025-04-14

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address alignment difficulties and paradigmatic complexity in conventional post-training of multimodal large language models (MLLMs), this paper proposes a native multimodal single-stage joint pretraining paradigm, yielding the InternVL3 series. Methodologically, it introduces: (1) a novel image-text–text hybrid pretraining framework that unifies visual and linguistic representation learning; (2) variable-length visual positional encoding (V2PE) to enable long visual context modeling; and (3) an integrated optimization strategy combining mixed preference optimization (MPO) with test-time scaling for enhanced inference robustness. Evaluated on MMMU, InternVL3-78B achieves 72.2, setting a new open-weight MLLM record. Its multimodal understanding performance rivals that of GPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro, while maintaining state-of-the-art pure-language capabilities.

Technology Category

Application Category

📝 Abstract

We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single pre-training stage. This unified training paradigm effectively addresses the complexities and alignment challenges commonly encountered in conventional post-hoc training pipelines for MLLMs. To further improve performance and scalability, InternVL3 incorporates variable visual position encoding (V2PE) to support extended multimodal contexts, employs advanced post-training techniques such as supervised fine-tuning (SFT) and mixed preference optimization (MPO), and adopts test-time scaling strategies alongside an optimized training infrastructure. Extensive empirical evaluations demonstrate that InternVL3 delivers superior performance across a wide range of multi-modal tasks. In particular, InternVL3-78B achieves a score of 72.2 on the MMMU benchmark, setting a new state-of-the-art among open-source MLLMs. Its capabilities remain highly competitive with leading proprietary models, including ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro, while also maintaining strong pure-language proficiency. In pursuit of open-science principles, we will publicly release both the training data and model weights to foster further research and development in next-generation MLLMs.

Problem

Research questions and friction points this paper is trying to address.

Develops unified multimodal pre-training for joint capability acquisition

Addresses alignment challenges in multimodal large language models

Enhances performance with advanced training and scaling techniques

Innovation

Methods, ideas, or system contributions that make the work stand out.

Native multimodal pre-training paradigm

Variable visual position encoding (V2PE)

Advanced post-training techniques (SFT, MPO)

🔎 Similar Papers

No similar papers found.

Authors to Follow