Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data

📅 2024-10-24
🏛️ arXiv.org
📈 Citations: 4
Influential: 1
📄 PDF
🤖 AI Summary
Existing open-source vision-language models (VLMs) are hindered by the limited scale and low quality of publicly available multimodal instruction datasets, preventing them from matching the performance of proprietary models trained on large, high-quality data. To address this, we introduce Infinity-MM—the first large-scale (40M+ samples), high-quality open multimodal instruction dataset—and propose a novel labeled synthetic generation method based on image–instruction type mapping, enabling controllable, scalable, and high-fidelity continual data synthesis. Leveraging a unified preprocessing framework and an open-VLM-assisted generation strategy, we train Aquila-VL-2B, a 2-billion-parameter VLM, which achieves state-of-the-art performance among similarly sized models on both multimodal understanding and generation benchmarks. All dataset resources and model weights are fully open-sourced.

Technology Category

Application Category

📝 Abstract
Recently, Vision-Language Models (VLMs) have achieved remarkable progress in multimodal tasks, and multimodal instruction data serves as the foundation for enhancing VLM capabilities. Despite the availability of several open-source multimodal datasets, limitations in the scale and quality of open-source instruction data hinder the performance of VLMs trained on these datasets, leading to a significant gap compared to models trained on closed-source data. To address this challenge, we introduce Infinity-MM, a large-scale multimodal instruction dataset. We collected the available multimodal instruction datasets and performed unified preprocessing, resulting in a dataset with over 40 million samples that ensures diversity and accuracy. Furthermore, to enable large-scale expansion of instruction data and support the continuous acquisition of high-quality data, we propose a synthetic instruction generation method based on a tagging system and open-source VLMs. By establishing correspondences between different types of images and associated instruction types, this method can provide essential guidance during data synthesis. Leveraging this high-quality data, we have trained a 2-billion-parameter Vision-Language Model, Aquila-VL-2B, which achieves state-of-the-art (SOTA) performance among models of similar scale. The data is available at: https://huggingface.co/datasets/BAAI/Infinity-MM.
Problem

Research questions and friction points this paper is trying to address.

Visual Language Models
Multi-modal Instruction Datasets
Performance Limitations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Infinity-MM
Multi-modal Instruction Dataset
Aquila-VL-2B
🔎 Similar Papers
No similar papers found.
Shuhao Gu
Shuhao Gu
Xiaomi
LLMVision-Language ModelAGI
J
Jialing Zhang
BAAI, BJTU
S
Siyuan Zhou
BAAI, BUPT
K
Kevin Yu
BAAI, ICT/CAS
Zhaohu Xing
Zhaohu Xing
Hong Kong University of Science and Technology (Guangzhou)
Medical Image AnalysisVideo UnderstandingImage Generation
L
Liangdong Wang
BAAI
Z
Zhou Cao
BAAI
J
Jintao Jia
BAAI, ICT/CAS
Z
Zhuoyi Zhang
BAAI, ICT/CAS
Y
Yixuan Wang
BAAI, ICT/CAS
Z
Zhenchong Hu
BAAI, ICT/CAS
B
Bo-Wen Zhang
BAAI
J
Jijie Li
BAAI
D
Dong Liang
BAAI
Y
Yingli Zhao
BAAI
Y
Yulong Ao
BAAI
Y
Yaoqi Liu
ICT/CAS
Fangxiang Feng
Fangxiang Feng
Beijing University of Posts and Telecommunications
Multimodal LearningImage Synthesis
Guang Liu
Guang Liu
BAAI
AI,LLMData