FRAbench and GenEval: Scaling Fine-Grained Aspect Evaluation across Tasks, Modalities

📅 2025-05-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM evaluation methods suffer from limited task coverage, poor modality adaptability, and insufficient fine-grained consistency assessment. Method: This paper introduces FRAbench—the first large-scale, multimodal, fine-grained evaluation benchmark (60.4k samples, 325k aspect-level labels)—and GenEval, a general-purpose evaluator. We propose a novel 112-dimensional hierarchical aspect taxonomy and pioneer a transferable, cross-task and cross-modal fine-grained evaluation paradigm, supported by high-quality human-AI collaborative aspect-level annotation. Contribution/Results: GenEval integrates multimodal unified modeling, LLM-as-a-judge fine-tuning, and meta-evaluation mechanisms, significantly outperforming baselines on core dimensions such as logical coherence and factual consistency. It achieves high agreement with GPT-4o and human experts (Cohen’s κ > 0.82) and improves zero-shot cross-domain accuracy by 12.7%, systematically exposing critical capability gaps in current multimodal foundation models.

Technology Category

Application Category

📝 Abstract
Evaluating the open-ended outputs of large language models (LLMs) has become a bottleneck as model capabilities, task diversity, and modality coverage rapidly expand. Existing"LLM-as-a-Judge"evaluators are typically narrow in a few tasks, aspects, or modalities, and easily suffer from low consistency. In this paper, we argue that explicit, fine-grained aspect specification is the key to both generalizability and objectivity in automated evaluation. To do so, we introduce a hierarchical aspect taxonomy spanning 112 aspects that unifies evaluation across four representative settings - Natural Language Generation, Image Understanding, Image Generation, and Interleaved Text-and-Image Generation. Building on this taxonomy, we create FRAbench, a benchmark comprising 60.4k pairwise samples with 325k aspect-level labels obtained from a combination of human and LLM annotations. FRAbench provides the first large-scale, multi-modal resource for training and meta-evaluating fine-grained LMM judges. Leveraging FRAbench, we develop GenEval, a fine-grained evaluator generalizable across tasks and modalities. Experiments show that GenEval (i) attains high agreement with GPT-4o and expert annotators, (ii) transfers robustly to unseen tasks and modalities, and (iii) reveals systematic weaknesses of current LMMs on evaluation.
Problem

Research questions and friction points this paper is trying to address.

Evaluating diverse LLM outputs lacks scalable, consistent methods
Current evaluators are narrow in tasks, aspects, or modalities
Need unified fine-grained aspect taxonomy for multi-modal evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical aspect taxonomy spanning 112 aspects
FRAbench benchmark with 60.4k pairwise samples
GenEval evaluator generalizable across tasks
🔎 Similar Papers
No similar papers found.
S
Shibo Hong
School of Computer Science, Fudan University
J
Jiahao Ying
School of Computer Science, Singapore Management University
H
Haiyuan Liang
School of Computer Science, Fudan University
M
Mengdi Zhang
Meituan
Jun Kuang
Jun Kuang
Meituan-M17
LLM
Jiazheng Zhang
Jiazheng Zhang
Fudan University
Large Language ModelNatural Language ProcessingData Mining
Y
Yixin Cao
School of Computer Science, Fudan University