MetaCaptioner: Towards Generalist Visual Captioning with Open-source Suites

📅 2025-10-14

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Current open-source visual captioning models significantly underperform commercial counterparts (e.g., GPT-4.1), hindering downstream applications such as synthetic data generation. To address this, we propose CapFlow, a multi-agent collaborative data synthesis framework that— for the first time—constructs high-quality image and video caption datasets using exclusively open-source models, reducing data generation costs by 89.5%. Leveraging this dataset, we train MetaCaptioner, a general-purpose visual captioner capable of cross-modal and cross-domain understanding. Extensive experiments demonstrate that MetaCaptioner achieves performance on par with GPT-4.1 across standard benchmarks—including MSCOCO and VideoCapsule—setting a new state-of-the-art among open-source models and substantially narrowing the quality gap between open- and closed-source visual captioning systems.

Technology Category

Application Category

📝 Abstract

Generalist visual captioning goes beyond a simple appearance description task, but requires integrating a series of visual cues into a caption and handling various visual domains. In this task, current open-source models present a large performance gap with commercial ones, which limits various applications such as data synthesis. To bridge the gap, this paper proposes CapFlow, a novel multi-agent collaboration workflow. CapFlow demonstrates for the first time that, by capitalizing on open-source models, it is possible to achieve caption quality on par with GPT-4.1 in various domains with an 89.5% reduction in costs. By leveraging CapFlow as the data synthesizer, we produce high-quality visual captions from image and video domains at scale, and obtain a generalist visual captioner via fine-tuning, namely MetaCaptioner. Through extensive experiments, we show that MetaCaptioner not only achieves comparable captioning capabilities with commercial models but also reaches top-tier multimodal performance in the open-source community. We hope CapFlow and MetaCaptioner can benefit future multimodal research by providing a strong and cost-effective visual captioning solution.

Problem

Research questions and friction points this paper is trying to address.

Bridging performance gap between open-source and commercial captioning models

Reducing costs while maintaining high-quality visual caption generation

Creating generalist captioner handling various visual domains effectively

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent workflow boosts caption quality

Open-source models match GPT-4 performance

Cost-effective data synthesis enables generalist captioner

🔎 Similar Papers

Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy, Trends and Metrics Analysis