T2I-CompBench++: An Enhanced and Comprehensive Benchmark for Compositional Text-to-Image Generation.

📅 2023-07-12

🏛️ IEEE Transactions on Pattern Analysis and Machine Intelligence

📈 Citations: 155

✨ Influential: 43

🤖 AI Summary

Current text-to-image (T2I) models exhibit significant limitations in compositional generation—particularly for complex scenes involving multiple objects, attribute binding, spatial reasoning, and numerical relationships. To address this, we introduce CompoBench, the first fine-grained compositional T2I benchmark comprising 8,000 prompts spanning eight subtasks across four categories: attribute binding, object relations, mathematical expressions, and complex composition—including novel challenges such as 3D spatial relations and numeral comprehension. We propose a detection-driven automatic evaluation metric and pioneer a multimodal large language model (MLLM)-based assessment framework leveraging GPT-4V and ShareGPT4V. Comprehensive evaluation across 11 state-of-the-art T2I models—including FLUX.1, SD3, and DALL·E-3—demonstrates the consistency and robustness of our metrics, while revealing fundamental deficiencies in existing models’ spatial reasoning and numerical compositional capabilities.

📝 Abstract

Despite the impressive advances in text-to-image models, they often struggle to effectively compose complex scenes with multiple objects, displaying various attributes and relationships. To address this challenge, we present T2I-CompBench++, an enhanced benchmark for compositional text-to-image generation. T2I-CompBench++ 8,000 compositional text prompts categorized into four primary groups: attribute binding, object relationships, generative numeracy, and complex compositions. These are further divided into eight sub-categories, including newly introduced ones like 3D-spatial relationships and numeracy. In addition to the benchmark, we propose enhanced evaluation metrics designed to assess these diverse compositional challenges. These include a detection-based metric tailored for evaluating 3D-spatial relationships and numeracy, and an analysis leveraging Multimodal Large Language Models (MLLMs), i.e. GPT-4 V, ShareGPT4v as evaluation metrics. Our experiments benchmark 11 text-to-image models, including state-of-the-art models, such as FLUX.1, SD3, DALLE-3, Pixart-, and SD-XL on T2I-CompBench++. We also conduct comprehensive evaluations to validate the effectiveness of our metrics and explore the potential and limitations of MLLMs. Project page is available at https://karine-h.github.io/T2I-CompBench-new/.

Problem

Research questions and friction points this paper is trying to address.

Evaluates text-to-image models on complex scene composition.

Introduces new metrics for 3D-spatial relationships and numeracy.

Benchmarks 11 models using enhanced evaluation techniques.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Enhanced benchmark for text-to-image generation

Detection-based metrics for 3D-spatial relationships

Multimodal Large Language Models for evaluation

🔎 Similar Papers

No similar papers found.

Authors to Follow