Evaluating the Evaluators: Metrics for Compositional Text-to-Image Generation

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Automated evaluation metrics for text-to-image generation are often adopted empirically, lacking systematic validation against human judgments—particularly for compositional alignment involving objects, attributes, and relational semantics. Method: We propose a multidimensional analytical framework and conduct the first unified benchmarking of three metric families—VQA-based, embedding-based, and image-only—using large-scale human annotations as the gold standard. Contribution/Results: No single metric family dominates across all dimensions; VQA-based metrics are not universally superior, and certain embedding-based metrics exhibit higher discriminative power for fine-grained relational alignment. Image-only metrics show limited capacity to model compositional alignment. Our findings reveal strong task dependency in metric behavior, providing empirical guidance and methodological insights for principled metric selection in text-to-image evaluation.

Technology Category

Application Category

📝 Abstract
Text-image generation has advanced rapidly, but assessing whether outputs truly capture the objects, attributes, and relations described in prompts remains a central challenge. Evaluation in this space relies heavily on automated metrics, yet these are often adopted by convention or popularity rather than validated against human judgment. Because evaluation and reported progress in the field depend directly on these metrics, it is critical to understand how well they reflect human preferences. To address this, we present a broad study of widely used metrics for compositional text-image evaluation. Our analysis goes beyond simple correlation, examining their behavior across diverse compositional challenges and comparing how different metric families align with human judgments. The results show that no single metric performs consistently across tasks: performance varies with the type of compositional problem. Notably, VQA-based metrics, though popular, are not uniformly superior, while certain embedding-based metrics prove stronger in specific cases. Image-only metrics, as expected, contribute little to compositional evaluation, as they are designed for perceptual quality rather than alignment. These findings underscore the importance of careful and transparent metric selection, both for trustworthy evaluation and for their use as reward models in generation. Project page is available at href{https://amirkasaei.com/eval-the-evals/}{this URL}.
Problem

Research questions and friction points this paper is trying to address.

Evaluating how well text-to-image models capture compositional prompts
Assessing automated metrics' alignment with human judgment in evaluation
Analyzing metric performance across different types of compositional challenges
Innovation

Methods, ideas, or system contributions that make the work stand out.

Conducts broad study of compositional text-image metrics
Analyzes metric performance across diverse compositional challenges
Compares metric families alignment with human judgments
🔎 Similar Papers
No similar papers found.