Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources

📅 2025-04-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address key bottlenecks in multimodal large language model (MLLM) pretraining—including the difficulty of high-quality data filtering, absence of effective multimodal mixing strategies, low sequence efficiency, and lack of open frameworks—this work achieves efficient, fully open pretraining of a 2B-parameter MLLM using only academic resources and just 442 A100-40G GPU-hours. We propose the first “fully open” standard, encompassing open-source training code, data filtering algorithms, and *all* pretraining and fine-tuning datasets. Technically, we introduce a low-to-high dynamic image resolution scaling scheme and a novel multimodal sequence packing strategy. Our pipeline integrates MLM-Filter with CLIP-based joint filtering, WebDataset format, and FSDP-based distributed training. On benchmarks including MMBench and SEEDBench, our model surpasses Qwen2-VL-2B despite consuming only 0.36% of its pretraining tokens (5B vs. 1.4T), significantly lowering computational and data curation barriers.

Technology Category

Application Category

📝 Abstract
The reproduction of state-of-the-art multimodal LLM pre-training faces barriers at every stage of the pipeline, including high-quality data filtering, multimodal data mixture strategies, sequence packing techniques, and training frameworks. We introduce Open-Qwen2VL, a fully open-source 2B-parameter Multimodal Large Language Model pre-trained efficiently on 29M image-text pairs using only 442 A100-40G GPU hours. Our approach employs low-to-high dynamic image resolution and multimodal sequence packing to significantly enhance pre-training efficiency. The training dataset was carefully curated using both MLLM-based filtering techniques (e.g., MLM-Filter) and conventional CLIP-based filtering methods, substantially improving data quality and training efficiency. The Open-Qwen2VL pre-training is conducted on academic level 8xA100-40G GPUs at UCSB on 5B packed multimodal tokens, which is 0.36% of 1.4T multimodal pre-training tokens of Qwen2-VL. The final instruction-tuned Open-Qwen2VL outperforms partially-open state-of-the-art MLLM Qwen2-VL-2B on various multimodal benchmarks of MMBench, SEEDBench, MMstar, and MathVista, indicating the remarkable training efficiency of Open-Qwen2VL. We open-source all aspects of our work, including compute-efficient and data-efficient training details, data filtering methods, sequence packing scripts, pre-training data in WebDataset format, FSDP-based training codebase, and both base and instruction-tuned model checkpoints. We redefine"fully open"for multimodal LLMs as the complete release of: 1) the training codebase, 2) detailed data filtering techniques, and 3) all pre-training and supervised fine-tuning data used to develop the model.
Problem

Research questions and friction points this paper is trying to address.

Efficient pre-training of multimodal LLMs on limited academic resources
Overcoming barriers in high-quality multimodal data filtering and processing
Achieving state-of-the-art performance with fully open-source model and data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic image resolution enhances training efficiency
Multimodal sequence packing optimizes token usage
MLLM and CLIP filtering improve data quality
🔎 Similar Papers
No similar papers found.