InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

πŸ“… 2025-04-14
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address alignment difficulties and paradigmatic complexity in conventional post-training of multimodal large language models (MLLMs), this paper proposes a native multimodal single-stage joint pretraining paradigm, yielding the InternVL3 series. Methodologically, it introduces: (1) a novel image-text–text hybrid pretraining framework that unifies visual and linguistic representation learning; (2) variable-length visual positional encoding (V2PE) to enable long visual context modeling; and (3) an integrated optimization strategy combining mixed preference optimization (MPO) with test-time scaling for enhanced inference robustness. Evaluated on MMMU, InternVL3-78B achieves 72.2, setting a new open-weight MLLM record. Its multimodal understanding performance rivals that of GPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro, while maintaining state-of-the-art pure-language capabilities.

Technology Category

Application Category

πŸ“ Abstract
We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single pre-training stage. This unified training paradigm effectively addresses the complexities and alignment challenges commonly encountered in conventional post-hoc training pipelines for MLLMs. To further improve performance and scalability, InternVL3 incorporates variable visual position encoding (V2PE) to support extended multimodal contexts, employs advanced post-training techniques such as supervised fine-tuning (SFT) and mixed preference optimization (MPO), and adopts test-time scaling strategies alongside an optimized training infrastructure. Extensive empirical evaluations demonstrate that InternVL3 delivers superior performance across a wide range of multi-modal tasks. In particular, InternVL3-78B achieves a score of 72.2 on the MMMU benchmark, setting a new state-of-the-art among open-source MLLMs. Its capabilities remain highly competitive with leading proprietary models, including ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro, while also maintaining strong pure-language proficiency. In pursuit of open-science principles, we will publicly release both the training data and model weights to foster further research and development in next-generation MLLMs.
Problem

Research questions and friction points this paper is trying to address.

Develops unified multimodal pre-training for joint capability acquisition
Addresses alignment challenges in multimodal large language models
Enhances performance with advanced training and scaling techniques
Innovation

Methods, ideas, or system contributions that make the work stand out.

Native multimodal pre-training paradigm
Variable visual position encoding (V2PE)
Advanced post-training techniques (SFT, MPO)
πŸ”Ž Similar Papers
No similar papers found.
J
Jinguo Zhu
Shanghai AI Laboratory
Weiyun Wang
Weiyun Wang
Shanghai AI Laboratory; Fudan University
Vision-Language ModelMLLMFoundation Model
Z
Zhe Chen
Nanjing University, Shanghai AI Laboratory
Zhaoyang Liu
Zhaoyang Liu
Tongyi Lab, Alibaba Group
LLMRecommendation
S
Shenglong Ye
Shanghai AI Laboratory
L
Lixin Gu
Shanghai AI Laboratory
Y
Yuchen Duan
The Chinese University of Hong Kong, Shanghai AI Laboratory
H
Hao Tian
Weijie Su
Weijie Su
Associate Professor, University of Pennsylvania
Machine LearningDifferential PrivacyHigh-Dimensional StatisticsOptimizationDeep Learning
Jie Shao
Jie Shao
Professor, University of Electronic Science and Technology of China
MultimediaDatabase
Z
Zhangwei Gao
Shanghai Jiao Tong University, Shanghai AI Laboratory
Erfei Cui
Erfei Cui
Shanghai AI Laboratory; Shanghai JiaoTong University
Computer Vision
Y
Yue Cao
Nanjing University, Shanghai AI Laboratory
Y
Yangzhou Liu
Nanjing University, Shanghai AI Laboratory
W
Weiye Xu
Shanghai AI Laboratory
H
Hao Li
Shanghai AI Laboratory
J
Jiahao Wang
Shanghai AI Laboratory
D
Dengnian Chen
Shanghai AI Laboratory
S
Songze Li
Shanghai AI Laboratory
Yinan He
Yinan He
Shanghai Al Laboratory
T
Tan Jiang
J
Jiapeng Luo
Y
Yi Wang
Shanghai AI Laboratory
Conghui He
Conghui He
Shanghai AI Laboratory
Data-centric AILLMDocument Intelligence
Botian Shi
Botian Shi
Shanghai Artificial Intelligence Laboratory
VLMsDocument UnderstandingAutonomous Driving
X
Xingcheng Zhang
Shanghai AI Laboratory
Wenqi Shao
Wenqi Shao
Researcher at Shanghai AI Laboratory
Foundation Model EvaluationLLM CompressionEfficient AdaptationMultimodal Learning
Junjun He
Junjun He
Shanghai Jiao Tong University
Y
Yingtong Xiong
Shanghai AI Laboratory
W
Wenwen Qu
Shanghai AI Laboratory
P
Peng Sun
Shanghai AI Laboratory
P
Penglong Jiao
Shanghai AI Laboratory
H
Han Lv
Shanghai AI Laboratory
Lijun Wu
Lijun Wu
Shanghai AI Laboratory
MLLLMAI4Science
Kaipeng Zhang
Kaipeng Zhang
Shanghai AI Laboratory
LLMMultimodal LLMsAIGC
H
Huipeng Deng
Shanghai AI Laboratory
J
Jiaye Ge
Shanghai AI Laboratory
K
Kaiming Chen
Shanghai AI Laboratory
L
Limin Wang
Nanjing University, Shanghai AI Laboratory
Mingsong Dou
Mingsong Dou
Shanghai AI Laboratory
Lewei Lu
Lewei Lu
Research Director (We're Hiring, luotto@sensetime.com) @ SenseTime Research
Computer VisionDeep Learning
Xizhou Zhu
Xizhou Zhu
Tsinghua University
T
Tong Lu
Nanjing University
Dahua Lin
Dahua Lin
The Chinese University of Hong Kong
computer visionmachine learningprobabilistic inferencebayesian nonparametrics
Y
Yu Qiao
Shanghai AI Laboratory
Jifeng Dai
Jifeng Dai
Associate Professor of EE, Tsinghua University
computer visiondeep learning
W
Wenhai Wang
The Chinese University of Hong Kong, Shanghai AI Laboratory