What Is a Good Caption? A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Coverage of MLLMs

📅 2025-02-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision captioning benchmarks are outdated in the MLLM era: they evaluate only short captions, employ obsolete metrics, and lack systematic characterization of correctness and coverage. To address this, we propose CV-CapBench—the first comprehensive benchmark for assessing visual description quality in MLLMs. It models captioning capability across six perspectives and thirteen fine-grained dimensions, introducing a novel triadic metric system—precision, recall, and hit rate—and the first fine-grained decomposition of dynamic scene understanding and knowledge-intensive description. Our methodology employs structured annotation via visual element decomposition, a hybrid evaluation pipeline combining model-assisted scoring with human verification, and an interpretable, attributable multi-dimensional scoring mechanism. Experiments reveal significant capability gaps in mainstream MLLMs on action temporal reasoning, cross-entity relational understanding, and commonsense inference. The code and fully annotated dataset are publicly released.

Technology Category

Application Category

📝 Abstract
Recent advancements in Multimodal Large Language Models (MLLMs) have rendered traditional visual captioning benchmarks obsolete, as they primarily evaluate short descriptions with outdated metrics. While recent benchmarks address these limitations by decomposing captions into visual elements and adopting model-based evaluation, they remain incomplete-overlooking critical aspects, while providing vague, non-explanatory scores. To bridge this gap, we propose CV-CapBench, a Comprehensive Visual Caption Benchmark that systematically evaluates caption quality across 6 views and 13 dimensions. CV-CapBench introduces precision, recall, and hit rate metrics for each dimension, uniquely assessing both correctness and coverage. Experiments on leading MLLMs reveal significant capability gaps, particularly in dynamic and knowledge-intensive dimensions. These findings provide actionable insights for future research. The code and data will be released.
Problem

Research questions and friction points this paper is trying to address.

Evaluates multimodal language model caption quality
Introduces precision, recall, hit rate metrics
Identifies gaps in dynamic, knowledge-intensive dimensions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces CV-CapBench for caption evaluation
Assesses correctness and coverage systematically
Uses precision, recall, and hit rate metrics
🔎 Similar Papers
No similar papers found.