GenEval 2: Addressing Benchmark Drift in Text-to-Image Evaluation

📅 2025-12-18

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Text-to-image (T2I) automatic evaluation suffers from “benchmark drift”: mainstream static benchmarks (e.g., GenEval) rapidly diverge from human judgments as models improve, exhibiting a current absolute error of 17.7% and clear saturation. Method: This work systematically identifies this drift and introduces GenEval 2—a dynamic, next-generation benchmark—and Soft-TIFA, a soft primitive decomposition evaluation method. GenEval 2 expands coverage of visual primitives and incorporates more compositional, challenging prompts. Soft-TIFA enables fine-grained primitive-level scoring and multi-dimensional human alignment validation. Contribution/Results: GenEval 2 avoids saturation, while Soft-TIFA significantly improves consistency with human judgments (substantially outperforming VQAScore) and temporal robustness against model advances. Empirical results confirm that Soft-TIFA effectively mitigates benchmark drift risk, establishing a more adaptive and human-aligned evaluation paradigm for evolving T2I systems.

Technology Category

Application Category

📝 Abstract

Automating Text-to-Image (T2I) model evaluation is challenging; a judge model must be used to score correctness, and test prompts must be selected to be challenging for current T2I models but not the judge. We argue that satisfying these constraints can lead to benchmark drift over time, where the static benchmark judges fail to keep up with newer model capabilities. We show that benchmark drift is a significant problem for GenEval, one of the most popular T2I benchmarks. Although GenEval was well-aligned with human judgment at the time of its release, it has drifted far from human judgment over time -- resulting in an absolute error of as much as 17.7% for current models. This level of drift strongly suggests that GenEval has been saturated for some time, as we verify via a large-scale human study. To help fill this benchmarking gap, we introduce a new benchmark, GenEval 2, with improved coverage of primitive visual concepts and higher degrees of compositionality, which we show is more challenging for current models. We also introduce Soft-TIFA, an evaluation method for GenEval 2 that combines judgments for visual primitives, which we show is more well-aligned with human judgment and argue is less likely to drift from human-alignment over time (as compared to more holistic judges such as VQAScore). Although we hope GenEval 2 will provide a strong benchmark for many years, avoiding benchmark drift is far from guaranteed and our work, more generally, highlights the importance of continual audits and improvement for T2I and related automated model evaluation benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Addresses benchmark drift in text-to-image model evaluation.

Introduces GenEval 2 with improved visual concept coverage.

Proposes Soft-TIFA for better human-judgment alignment.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces GenEval 2 benchmark with enhanced visual concepts

Proposes Soft-TIFA method combining primitive judgments for evaluation

Highlights need for continual audits to prevent benchmark drift

🔎 Similar Papers

No similar papers found.

Authors to Follow