🤖 AI Summary
Existing LLM evaluation methods suffer from limited task coverage, poor modality adaptability, and insufficient fine-grained consistency assessment. Method: This paper introduces FRAbench—the first large-scale, multimodal, fine-grained evaluation benchmark (60.4k samples, 325k aspect-level labels)—and GenEval, a general-purpose evaluator. We propose a novel 112-dimensional hierarchical aspect taxonomy and pioneer a transferable, cross-task and cross-modal fine-grained evaluation paradigm, supported by high-quality human-AI collaborative aspect-level annotation. Contribution/Results: GenEval integrates multimodal unified modeling, LLM-as-a-judge fine-tuning, and meta-evaluation mechanisms, significantly outperforming baselines on core dimensions such as logical coherence and factual consistency. It achieves high agreement with GPT-4o and human experts (Cohen’s κ > 0.82) and improves zero-shot cross-domain accuracy by 12.7%, systematically exposing critical capability gaps in current multimodal foundation models.
📝 Abstract
Evaluating the open-ended outputs of large language models (LLMs) has become a bottleneck as model capabilities, task diversity, and modality coverage rapidly expand. Existing"LLM-as-a-Judge"evaluators are typically narrow in a few tasks, aspects, or modalities, and easily suffer from low consistency. In this paper, we argue that explicit, fine-grained aspect specification is the key to both generalizability and objectivity in automated evaluation. To do so, we introduce a hierarchical aspect taxonomy spanning 112 aspects that unifies evaluation across four representative settings - Natural Language Generation, Image Understanding, Image Generation, and Interleaved Text-and-Image Generation. Building on this taxonomy, we create FRAbench, a benchmark comprising 60.4k pairwise samples with 325k aspect-level labels obtained from a combination of human and LLM annotations. FRAbench provides the first large-scale, multi-modal resource for training and meta-evaluating fine-grained LMM judges. Leveraging FRAbench, we develop GenEval, a fine-grained evaluator generalizable across tasks and modalities. Experiments show that GenEval (i) attains high agreement with GPT-4o and expert annotators, (ii) transfers robustly to unseen tasks and modalities, and (iii) reveals systematic weaknesses of current LMMs on evaluation.