FingER: Content Aware Fine-grained Evaluation with Reasoning for AI-Generated Videos

📅 2025-04-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Evaluating AI-generated videos faces challenges in detecting fine-grained artifacts and producing interpretable, quantitative scores. This paper proposes FingER, an entity-level fine-grained reasoning evaluation framework: first, an LLM automatically generates entity-oriented questions covering five analytical dimensions; then, a multimodal LLM performs stepwise reasoning to assign scores, which are aggregated via learned weights. Key contributions include: (1) the first entity-level, explanation-driven QA-based evaluation paradigm; (2) the first large-scale fine-grained video QA benchmark—FingER-Bench—with 3.3K videos and 60K rationale-annotated QA pairs; and (3) a cold-start GRPO training strategy to enhance reasoning accuracy. On GenAI-Bench and MonetBench, FingER outperforms state-of-the-art methods by 11.8% and 5.5%, respectively, using only 3.3K samples—less than one-tenth of competing approaches—achieving high efficiency, accuracy, and interpretability in video assessment.

Technology Category

Application Category

📝 Abstract
Recent advances in video generation have posed great challenges in the assessment of AI-generated content, particularly with the emergence of increasingly sophisticated models. The various inconsistencies and defects observed in such videos are inherently complex, making overall scoring notoriously difficult. In this paper, we emphasize the critical importance of integrating fine-grained reasoning into video evaluation, and we propose $ extbf{F}$ing$ extbf{ER}$, a novel entity-level reasoning evaluation framework that first automatically generates $ extbf{F}$ine-grained $ extbf{E}$ntity-level questions, and then answers those questions by a $ extbf{R}$easoning model with scores, which can be subsequently weighted summed to an overall score for different applications. Specifically, we leverage LLMs to derive entity-level questions across five distinct perspectives, which (i) often focus on some specific entities of the content, thereby making answering or scoring much easier by MLLMs, and (ii) are more interpretable. Then we construct a FingER dataset, consisting of approximately 3.3k videos and corresponding 60k fine-grained QA annotations, each with detailed reasons. Based on that, we further investigate various training protocols to best incentivize the reasoning capability of MLLMs for correct answer prediction. Extensive experiments demonstrate that a reasoning model trained using Group Relative Policy Optimization (GRPO) with a cold-start strategy achieves the best performance. Notably, our model surpasses existing methods by a relative margin of $11.8%$ on GenAI-Bench and $5.5%$ on MonetBench with only 3.3k training videos, which is at most one-tenth of the training samples utilized by other methods. Our code and dataset will be released soon.
Problem

Research questions and friction points this paper is trying to address.

Assessing inconsistencies in AI-generated videos effectively
Developing fine-grained entity-level evaluation with reasoning
Improving interpretability and scoring accuracy for video content
Innovation

Methods, ideas, or system contributions that make the work stand out.

Entity-level reasoning framework for video evaluation
LLMs generate fine-grained entity-level questions
GRPO training enhances MLLMs reasoning performance
🔎 Similar Papers
No similar papers found.
R
Rui Chen
AMAP, Alibaba Group
L
Lei Sun
AMAP, Alibaba Group
J
Jing Tang
AMAP, Alibaba Group
Geng Li
Geng Li
Peking University
X
Xiangxiang Chu
AMAP, Alibaba Group