🤖 AI Summary
To address scalability, data quality, and training efficiency bottlenecks in vision-language models (VLMs), this paper introduces SAIL-VL—a 2-billion-parameter open-source VLM accompanied by a high-quality, scalable training paradigm. We propose the first “quality–quantity dual-scaling” training framework, integrating multi-stage curriculum learning for quality scaling to achieve near-linear performance gains with increasing data scale. We construct SAIL-Caption, the highest-quality open-source billion-scale image–text dataset to date, leveraging automated captioning, rigorous data quality assessment and filtering, hundred-billion-token joint pretraining, and curriculum-driven supervised fine-tuning. SAIL-VL achieves state-of-the-art average performance across 19 mainstream multimodal benchmarks and ranks first among comparable 2B-parameter models on OpenCompass. The model and training recipes are publicly released on Hugging Face.
📝 Abstract
In this paper, we introduce SAIL-VL (ScAlable Vision Language Model TraIning via High QuaLity Data Curation), an open-source vision language model (VLM) of state-of-the-art (SOTA) performance with 2B parameters. We introduce three key improvements that contribute to SAIL-VL's leading performance: (1) Scalable high-quality visual understanding data construction: We implement a visual understanding data construction pipeline, which enables hundred-million-scale high-quality recaption data annotation. Equipped with this pipeline, we curate SAIL-Caption, a large-scale caption dataset with large quantity and the highest data quality compared with opensource caption datasets. (2) Scalable Pretraining with High-Quality Visual Understanding Data: We scale SAIL-VL's pretraining budget up to 131B tokens and show that even a 2B VLM benefits from scaled up training data sizes, exhibiting expected data size scaling laws in visual understanding and instruction following performance. (3) Scalable SFT via quantity and quality scaling: We introduce general guidance for instruction data curation to scale up instruction data continuously, allowing us to construct a large SFT dataset with the highest quality. To further improve SAIL-VL's performance, we propose quality scaling, a multi-stage training recipe with curriculum learning, to improve model performance scaling curves w.r.t. data sizes from logarithmic to be near-linear. SAIL-VL obtains the highest average score in 19 commonly used benchmarks in our evaluation and achieves top1 performance among VLMs of comparable sizes on OpenCompass (https://rank.opencompass.org.cn/leaderboard-multimodal). We release our SAIL-VL-2B model at HuggingFace (https://huggingface.co/BytedanceDouyinContent/SAIL-VL-2B).