đ¤ AI Summary
Existing automatic evaluation methods for text-to-image alignment predominantly prioritize correlation with human judgments while neglecting foundational trustworthiness attributesâconsistency and robustnessâessential for reliable assessment.
Method: The authors formally define and empirically validate these two trustworthiness properties through systematic, controlled experiments across diverse diffusion models (e.g., Stable Diffusion, SDXL, DALL¡E 3) and alignment metrics (e.g., CLIPScore, TIFA, Pick-a-Pic), complemented by attribution analysis.
Contribution/Results: All 12 mainstream evaluation methods violate at least one trustworthiness property. To address this, the authors propose a reproducible and scalable framework for evaluation improvementâalready adopted by three top-tier conference papersâthereby shifting the paradigm of textâimage alignment evaluation from âcorrelation-orientedâ to âtrustworthiness-oriented.â
đ Abstract
Text-to-image models often struggle to generate images that precisely match textual prompts. Prior research has extensively studied the evaluation of image-text alignment in text-to-image generation. However, existing evaluations primarily focus on agreement with human assessments, neglecting other critical properties of a trustworthy evaluation framework. In this work, we first identify two key aspects that a reliable evaluation should address. We then empirically demonstrate that current mainstream evaluation frameworks fail to fully satisfy these properties across a diverse range of metrics and models. Finally, we propose recommendations for improving image-text alignment evaluation.