VELA: An LLM-Hybrid-as-a-Judge Approach for Evaluating Long Image Captions

📅 2025-09-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing automatic evaluation metrics are designed for short image captions and thus fail to accurately assess long, detailed descriptions generated by multimodal large language models (MLLMs). Moreover, mainstream LLM-as-a-Judge approaches suffer from low efficiency and high bias due to autoregressive inference and early fusion of visual information. To address these limitations, we propose VELA—a novel, efficient framework for evaluating long image descriptions—introducing the first LLM-Hybrid-as-a-Judge paradigm. VELA delays visual fusion, employs multi-stage reasoning, and aligns with multi-perspective human annotations across three dimensions: Descriptiveness, Relevance, and Fluency. We further introduce LongCap-Arena, the first benchmark dedicated to long caption evaluation, comprising nearly 8,000 images and tens of thousands of high-quality human annotations. Experiments demonstrate that VELA significantly outperforms existing metrics on LongCap-Arena, achieving exceptional agreement with human judgments (Spearman’s ρ > 0.92) and, for the first time, surpassing human-level performance in automatic long-description evaluation.

Technology Category

Application Category

📝 Abstract
In this study, we focus on the automatic evaluation of long and detailed image captions generated by multimodal Large Language Models (MLLMs). Most existing automatic evaluation metrics for image captioning are primarily designed for short captions and are not suitable for evaluating long captions. Moreover, recent LLM-as-a-Judge approaches suffer from slow inference due to their reliance on autoregressive inference and early fusion of visual information. To address these limitations, we propose VELA, an automatic evaluation metric for long captions developed within a novel LLM-Hybrid-as-a-Judge framework. Furthermore, we propose LongCap-Arena, a benchmark specifically designed for evaluating metrics for long captions. This benchmark comprises 7,805 images, the corresponding human-provided long reference captions and long candidate captions, and 32,246 human judgments from three distinct perspectives: Descriptiveness, Relevance, and Fluency. We demonstrated that VELA outperformed existing metrics and achieved superhuman performance on LongCap-Arena.
Problem

Research questions and friction points this paper is trying to address.

Evaluating long detailed image captions from multimodal LLMs
Overcoming limitations of short-caption metrics and slow LLM-judge inference
Creating benchmark and metric for comprehensive long-caption assessment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid LLM framework for caption evaluation
Late fusion technique to accelerate inference
Specialized benchmark with multi-perspective human judgments
🔎 Similar Papers
No similar papers found.
K
Kazuki Matsuda
Keio University
Yuiga Wada
Yuiga Wada
Ph.D. Student, Keio University
Machine Learning
S
Shinnosuke Hirano
Keio University
S
Seitaro Otsuki
Keio University
Komei Sugiura
Komei Sugiura
Professor, Keio University
Multimodal AIRobot LearningEmbodied AIMachine Learning