Low-hallucination Synthetic Captions for Large-Scale Vision-Language Model Pre-training

📅 2025-04-17

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

High-quality image-text pairs for vision-language (VL) large model pretraining are scarce and difficult to scale. Method: This paper proposes a low-hallucination synthetic caption generation paradigm, introducing a controllable large language model–based captioning framework with multi-stage knowledge injection and a continuous direct preference optimization (DPO) pipeline, significantly reducing hallucination (non-hallucination rate improves from 48.2% to 77.9%). Contribution/Results: Empirical evaluation—first of its kind—demonstrates that low-hallucination synthetic captions yield average performance gains ≥6.2% across 35 VL benchmarks, outperforming real weak-supervision alternatives such as alt-text. They also enhance text-to-image generation quality, lowering FID by 17.1 on a real-world validation set and by 13.3 on MSCOCO. To support scalable, high-fidelity VL pretraining, we release Hunyuan-Recap100M, a large-scale synthetic dataset.

Technology Category

Application Category

📝 Abstract

In recent years, the field of vision-language model pre-training has experienced rapid advancements, driven primarily by the continuous enhancement of textual capabilities in large language models. However, existing training paradigms for multimodal large language models heavily rely on high-quality image-text pairs. As models and data scales grow exponentially, the availability of such meticulously curated data has become increasingly scarce and saturated, thereby severely limiting further advancements in this domain. This study investigates scalable caption generation techniques for vision-language model pre-training and demonstrates that large-scale low-hallucination synthetic captions can serve dual purposes: 1) acting as a viable alternative to real-world data for pre-training paradigms and 2) achieving superior performance enhancement when integrated into vision-language models through empirical validation. This paper presents three key contributions: 1) a novel pipeline for generating high-quality, low-hallucination, and knowledge-rich synthetic captions. Our continuous DPO methodology yields remarkable results in reducing hallucinations. Specifically, the non-hallucination caption rate on a held-out test set increases from 48.2% to 77.9% for a 7B-size model. 2) Comprehensive empirical validation reveals that our synthetic captions confer superior pre-training advantages over their counterparts. Across 35 vision language tasks, the model trained with our data achieves a significant performance gain of at least 6.2% compared to alt-text pairs and other previous work. Meanwhile, it also offers considerable support in the text-to-image domain. With our dataset, the FID score is reduced by 17.1 on a real-world validation benchmark and 13.3 on the MSCOCO validation benchmark. 3) We will release Hunyuan-Recap100M, a low-hallucination and knowledge-intensive synthetic caption dataset.

Problem

Research questions and friction points this paper is trying to address.

Generating low-hallucination synthetic captions for vision-language models

Reducing reliance on scarce high-quality image-text pairs

Enhancing performance in vision-language and text-to-image tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates low-hallucination synthetic captions

Uses continuous DPO methodology

Releases Hunyuan-Recap100M dataset

🔎 Similar Papers

SynthVLM: High-Efficiency and High-Quality Synthetic Data for Vision Language Models