🤖 AI Summary
Current text-to-image (T2I) models exhibit limited capabilities in complex reasoning tasks—such as commonsense, mathematical, and logical reasoning—and lack systematic, dedicated evaluation benchmarks. Method: We introduce R2I-Bench, the first benchmark explicitly designed to assess reasoning abilities in T2I generation, covering seven distinct reasoning categories. We propose R2IScore, a fine-grained, question-based evaluation metric that quantifies performance across three dimensions: textual-visual alignment, reasoning accuracy, and image quality. Additionally, we construct a human-annotated ground-truth dataset and a decoupled pipeline validation framework. Contribution/Results: Extensive experiments on 16 state-of-the-art T2I models reveal pervasive deficiencies across all reasoning categories; even the best-performing model falls significantly short of ideal performance. R2I-Bench is publicly released as an open-source benchmark, establishing a new standard for rigorous, reasoning-focused T2I evaluation.
📝 Abstract
Reasoning is a fundamental capability often required in real-world text-to-image (T2I) generation, e.g., generating ``a bitten apple that has been left in the air for more than a week`` necessitates understanding temporal decay and commonsense concepts. While recent T2I models have made impressive progress in producing photorealistic images, their reasoning capability remains underdeveloped and insufficiently evaluated. To bridge this gap, we introduce R2I-Bench, a comprehensive benchmark specifically designed to rigorously assess reasoning-driven T2I generation. R2I-Bench comprises meticulously curated data instances, spanning core reasoning categories, including commonsense, mathematical, logical, compositional, numerical, causal, and concept mixing. To facilitate fine-grained evaluation, we design R2IScore, a QA-style metric based on instance-specific, reasoning-oriented evaluation questions that assess three critical dimensions: text-image alignment, reasoning accuracy, and image quality. Extensive experiments with 16 representative T2I models, including a strong pipeline-based framework that decouples reasoning and generation using the state-of-the-art language and image generation models, demonstrate consistently limited reasoning performance, highlighting the need for more robust, reasoning-aware architectures in the next generation of T2I systems. Project Page: https://r2i-bench.github.io