🤖 AI Summary
This work addresses the lack of systematic evaluation of semantic reasoning capabilities in text-to-image (T2I) models. We introduce the first multi-task benchmark covering four dimensions: idiom comprehension, text-guided image design, entity reasoning, and scientific reasoning. Our two-stage evaluation framework first employs prompt engineering to generate reasoning-oriented tasks; then jointly applies human judgment and automated metrics to assess both reasoning accuracy and image fidelity. Crucially, this framework enables the first quantitative analysis of deep cross-modal semantic alignment. Empirical evaluation across mainstream T2I models reveals significant bottlenecks in representing abstract concepts and performing logical reasoning. Our benchmark provides a reproducible foundation and theoretical grounding for modeling, evaluating, and improving reasoning capabilities in T2I systems.
📝 Abstract
We propose T2I-ReasonBench, a benchmark evaluating reasoning capabilities of text-to-image (T2I) models. It consists of four dimensions: Idiom Interpretation, Textual Image Design, Entity-Reasoning and Scientific-Reasoning. We propose a two-stage evaluation protocol to assess the reasoning accuracy and image quality. We benchmark various T2I generation models, and provide comprehensive analysis on their performances.