FRAbench and GenEval: Scaling Fine-Grained Aspect Evaluation across Tasks, Modalities

📅 2025-05-19

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing LLM evaluation methods suffer from limited task coverage, poor modality adaptability, and insufficient fine-grained consistency assessment. Method: This paper introduces FRAbench—the first large-scale, multimodal, fine-grained evaluation benchmark (60.4k samples, 325k aspect-level labels)—and GenEval, a general-purpose evaluator. We propose a novel 112-dimensional hierarchical aspect taxonomy and pioneer a transferable, cross-task and cross-modal fine-grained evaluation paradigm, supported by high-quality human-AI collaborative aspect-level annotation. Contribution/Results: GenEval integrates multimodal unified modeling, LLM-as-a-judge fine-tuning, and meta-evaluation mechanisms, significantly outperforming baselines on core dimensions such as logical coherence and factual consistency. It achieves high agreement with GPT-4o and human experts (Cohen’s κ > 0.82) and improves zero-shot cross-domain accuracy by 12.7%, systematically exposing critical capability gaps in current multimodal foundation models.

Technology Category

Application Category

📝 Abstract

Evaluating the open-ended outputs of large language models (LLMs) has become a bottleneck as model capabilities, task diversity, and modality coverage rapidly expand. Existing"LLM-as-a-Judge"evaluators are typically narrow in a few tasks, aspects, or modalities, and easily suffer from low consistency. In this paper, we argue that explicit, fine-grained aspect specification is the key to both generalizability and objectivity in automated evaluation. To do so, we introduce a hierarchical aspect taxonomy spanning 112 aspects that unifies evaluation across four representative settings - Natural Language Generation, Image Understanding, Image Generation, and Interleaved Text-and-Image Generation. Building on this taxonomy, we create FRAbench, a benchmark comprising 60.4k pairwise samples with 325k aspect-level labels obtained from a combination of human and LLM annotations. FRAbench provides the first large-scale, multi-modal resource for training and meta-evaluating fine-grained LMM judges. Leveraging FRAbench, we develop GenEval, a fine-grained evaluator generalizable across tasks and modalities. Experiments show that GenEval (i) attains high agreement with GPT-4o and expert annotators, (ii) transfers robustly to unseen tasks and modalities, and (iii) reveals systematic weaknesses of current LMMs on evaluation.

Problem

Research questions and friction points this paper is trying to address.

Evaluating diverse LLM outputs lacks scalable, consistent methods

Current evaluators are narrow in tasks, aspects, or modalities

Need unified fine-grained aspect taxonomy for multi-modal evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical aspect taxonomy spanning 112 aspects

FRAbench benchmark with 60.4k pairwise samples

GenEval evaluator generalizable across tasks

🔎 Similar Papers

No similar papers found.

Authors to Follow