SAIL-VL2 Technical Report

📅 2025-09-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limitations of existing vision-language models in fine-grained perception and complex reasoning, this paper introduces SAIL-VL2, an open-source multimodal foundation model supporting both image and video understanding. Methodologically, we (1) develop a high-quality image-text–video data curation pipeline; (2) propose a progressive training framework integrating visual encoder pretraining with chain-of-thought–enhanced supervised fine-tuning and reinforcement learning; and (3) adopt the SAIL-ViT visual backbone coupled with a sparse Mixture-of-Experts (MoE) architecture to improve parameter efficiency. Evaluated across 106 benchmarks, SAIL-VL2-2B achieves state-of-the-art performance on key reasoning tasks—including MMMU and MathVista—and ranks first on the OpenCompass leaderboard for 2B-scale models. These results significantly advance the development of open multimodal foundation models.

Technology Category

Application Category

📝 Abstract
We introduce SAIL-VL2, an open-suite vision-language foundation model (LVM) for comprehensive multimodal understanding and reasoning. As the successor to SAIL-VL, SAIL-VL2 achieves state-of-the-art performance at the 2B and 8B parameter scales across diverse image and video benchmarks, demonstrating strong capabilities from fine-grained perception to complex reasoning. Three core innovations drive its effectiveness. First, a large-scale data curation pipeline with scoring and filtering strategies enhances both quality and distribution across captioning, OCR, QA, and video data, improving training efficiency. Second, a progressive training framework begins with a powerful pre-trained vision encoder (SAIL-ViT), advances through multimodal pre-training, and culminates in a thinking-fusion SFT-RL hybrid paradigm that systematically strengthens model capabilities. Third, architectural advances extend beyond dense LLMs to efficient sparse Mixture-of-Experts (MoE) designs. With these contributions, SAIL-VL2 demonstrates competitive performance across 106 datasets and achieves state-of-the-art results on challenging reasoning benchmarks such as MMMU and MathVista. Furthermore, on the OpenCompass leaderboard, SAIL-VL2-2B ranks first among officially released open-source models under the 4B parameter scale, while serving as an efficient and extensible foundation for the open-source multimodal community.
Problem

Research questions and friction points this paper is trying to address.

Advancing multimodal understanding and reasoning capabilities
Enhancing training efficiency through data curation and filtering
Extending architectural designs to efficient sparse MoE models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale data curation pipeline
Progressive training framework with SFT-RL
Sparse Mixture-of-Experts architectural design
🔎 Similar Papers
No similar papers found.
Weijie Yin
Weijie Yin
ByteDance
Vision Language ModelDeep LearningAI4S
Y
Yongjie Ye
Douyin SAIL Team, LV-NUS Lab
Fangxun Shu
Fangxun Shu
Bytedance
Multimodal
Yue Liao
Yue Liao
National University of Singapore
Computer VisionDeep LearningMLLM
Z
Zijian Kang
Douyin SAIL Team, LV-NUS Lab
H
Hongyuan Dong
Douyin SAIL Team, LV-NUS Lab
H
Haiyang Yu
Douyin SAIL Team, LV-NUS Lab
Dingkang Yang
Dingkang Yang
ByteDance
Multimodal LearningGenerative AIEmbodied AI
Jiacong Wang
Jiacong Wang
University of Chinese Academy of Sciences; ByteDance
CVMultimodalMLLM
H
Han Wang
Douyin SAIL Team, LV-NUS Lab
W
Wenzhuo Liu
Douyin SAIL Team, LV-NUS Lab
X
Xiao Liang
Douyin SAIL Team, LV-NUS Lab
S
Shuicheng Yan
Douyin SAIL Team, LV-NUS Lab
Chao Feng
Chao Feng
University of Zurich
networkmachine learningcybersecurity