Train a Unified Multimodal Data Quality Classifier with Synthetic Data

📅 2025-10-16

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Current pretraining of multimodal large language models (MLLMs) lacks a unified, efficient approach for filtering low-quality image-text descriptions and interleaved document data. Method: This paper proposes UniFilter—the first unified classifier capable of jointly assessing quality across both modalities. Leveraging a semi-synthetic strategy to generate multi-level quality-annotated samples, UniFilter integrates multimodal large models, quality-tiered prompt engineering, and supervised contrastive learning to enable fine-grained quality evaluation. Contribution/Results: Applied to DataComp and OBELICS, UniFilter identifies high-quality subsets; MLLMs pretrained on these subsets achieve superior zero-shot reasoning, in-context learning, and vision-language fine-tuning performance—outperforming baselines and attaining state-of-the-art results across multiple benchmarks.

Technology Category

Application Category

📝 Abstract

The Multimodal Large Language Models (MLLMs) are continually pre-trained on a mixture of image-text caption data and interleaved document data, while the high-quality data filtering towards image-text interleaved document data is under-explored. We propose to train an efficient MLLM as a Unified Mulitmodal Data Quality Classifier to Filter both high-quality image-text caption and interleaved data (UniFilter). To address the challenge of collecting diverse labeled multimodal data, we introduce a semi-synthetic approach that leverages readily available raw images and generates corresponding text across four quality levels. This method enables efficient creation of sample-score pairs for both caption and interleaved document data to train UniFilter. We apply UniFilter to curate high-quality caption data from DataComp caption dataset and interleaved data from the OBELICS image-text interleaved dataset. MLLMs pre-trained on the filtered data demonstrate significantly enhanced capabilities compared to those trained on baseline-filtered data, achieving stronger zero-shot reasoning and in-context learning capabilities. After visual supervised fine-tuning, these UniFilter-induced MLLMs achieve stronger performance on various benchmarks, highlighting the downstream benefits of high-quality multimodal pre-training. We release the synthetic training data used for training UniFilter, the UniFilter model checkpoints, and the high-quality interleaved document subset OBELICS-HQ, curated by UniFilter, to the community for reproduction and further development.

Problem

Research questions and friction points this paper is trying to address.

Addresses under-explored quality filtering for multimodal image-text interleaved document data

Trains unified multimodal classifier using synthetic data to overcome labeled data scarcity

Enhances MLLMs' reasoning and learning through curated high-quality multimodal pre-training data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Trains unified multimodal quality classifier with synthetic data

Generates text across quality levels using raw images

Filters high-quality caption and interleaved document data

🔎 Similar Papers

Incomplete multimodal industrial anomaly detection via cross-modal distillation