🤖 AI Summary
Existing T2I and image editing benchmarks suffer from a critical disconnect: T2I benchmarks lack multimodal conditioning, while editing benchmarks neglect compositional semantics and commonsense reasoning—leading to incomplete evaluation of multimodal generative models. To address this, we introduce MMIG-Bench, the first comprehensive multimodal image generation benchmark, covering three core tasks—text-to-image generation, image editing, and concept consistency—with 4,850 multi-granularity prompts and 1,750 multi-perspective reference image sets. We propose a three-level interpretable evaluation framework: low-level visual fidelity, mid-level Aspect Matching Score (AMS) grounded in VQA (strongly correlated with human judgment, ρ > 0.87), and high-level aesthetic and preference assessment. Our framework integrates VQA models, multi-scale quality metrics, 32k crowdsourced human ratings, and semantic alignment analysis. Extensive evaluation of 17 state-of-the-art models reveals critical impacts of architectural choices and training data design. All data and code are publicly released.
📝 Abstract
Recent multimodal image generators such as GPT-4o, Gemini 2.0 Flash, and Gemini 2.5 Pro excel at following complex instructions, editing images and maintaining concept consistency. However, they are still evaluated by disjoint toolkits: text-to-image (T2I) benchmarks that lacks multi-modal conditioning, and customized image generation benchmarks that overlook compositional semantics and common knowledge. We propose MMIG-Bench, a comprehensive Multi-Modal Image Generation Benchmark that unifies these tasks by pairing 4,850 richly annotated text prompts with 1,750 multi-view reference images across 380 subjects, spanning humans, animals, objects, and artistic styles. MMIG-Bench is equipped with a three-level evaluation framework: (1) low-level metrics for visual artifacts and identity preservation of objects; (2) novel Aspect Matching Score (AMS): a VQA-based mid-level metric that delivers fine-grained prompt-image alignment and shows strong correlation with human judgments; and (3) high-level metrics for aesthetics and human preference. Using MMIG-Bench, we benchmark 17 state-of-the-art models, including Gemini 2.5 Pro, FLUX, DreamBooth, and IP-Adapter, and validate our metrics with 32k human ratings, yielding in-depth insights into architecture and data design. We will release the dataset and evaluation code to foster rigorous, unified evaluation and accelerate future innovations in multi-modal image generation.