MMMG: A Massive, Multidisciplinary, Multi-Tier Generation Benchmark for Text-to-Image Reasoning

📅 2025-06-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Text-to-image models exhibit weak multimodal reasoning capabilities when generating knowledge-intensive images (e.g., charts, mind maps), hindering accurate semantic and structural fidelity. Method: We introduce “knowledge image generation” as a novel task and establish MMMG—the first expert-validated, cross-disciplinary benchmark comprising 4,456 image-text pairs spanning 10 subjects and 6 educational stages—where image semantics are uniformly represented via knowledge graphs (KGs). We propose KG-driven MMMG-Score, a novel evaluation metric integrating graph edit distance and visual clarity to quantify reasoning quality. Our joint architecture synergizes KG modeling, diffusion-based generation, and reasoning-oriented large language models (e.g., GPT-4o, FLUX-Reason). Contribution/Results: We open-source FLUX-Reason, a baseline achieving 34.45 MMMG-Score. Comprehensive evaluation of 16 SOTA models reveals pervasive reasoning deficits (e.g., GPT-4o scores only 50.20), underscoring the need for interpretable, knowledge-grounded image generation.

Technology Category

Application Category

📝 Abstract
In this paper, we introduce knowledge image generation as a new task, alongside the Massive Multi-Discipline Multi-Tier Knowledge-Image Generation Benchmark (MMMG) to probe the reasoning capability of image generation models. Knowledge images have been central to human civilization and to the mechanisms of human learning--a fact underscored by dual-coding theory and the picture-superiority effect. Generating such images is challenging, demanding multimodal reasoning that fuses world knowledge with pixel-level grounding into clear explanatory visuals. To enable comprehensive evaluation, MMMG offers 4,456 expert-validated (knowledge) image-prompt pairs spanning 10 disciplines, 6 educational levels, and diverse knowledge formats such as charts, diagrams, and mind maps. To eliminate confounding complexity during evaluation, we adopt a unified Knowledge Graph (KG) representation. Each KG explicitly delineates a target image's core entities and their dependencies. We further introduce MMMG-Score to evaluate generated knowledge images. This metric combines factual fidelity, measured by graph-edit distance between KGs, with visual clarity assessment. Comprehensive evaluations of 16 state-of-the-art text-to-image generation models expose serious reasoning deficits--low entity fidelity, weak relations, and clutter--with GPT-4o achieving an MMMG-Score of only 50.20, underscoring the benchmark's difficulty. To spur further progress, we release FLUX-Reason (MMMG-Score of 34.45), an effective and open baseline that combines a reasoning LLM with diffusion models and is trained on 16,000 curated knowledge image-prompt pairs.
Problem

Research questions and friction points this paper is trying to address.

Proposing knowledge image generation as a new task
Evaluating reasoning capability of image generation models
Addressing challenges in multimodal knowledge-image fusion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Knowledge Graph representation for evaluation
MMMG-Score metric combining fidelity and clarity
FLUX-Reason baseline combining LLM and diffusion
🔎 Similar Papers
No similar papers found.
Yuxuan Luo
Yuxuan Luo
City University of Hong Kong
Few shot learningZero shot learningContinual learning
Yuhui Yuan
Yuhui Yuan
Canva CORE, ex-Microsoft Research Asia
Generative AI + DesignComputer Vision
J
Junwen Chen
The University of Electro-Communications
H
Haonan Cai
Wangxuan Institute of Computer Technology, Peking University, China
Z
Ziyi Yue
Wangxuan Institute of Computer Technology, Peking University, China
Y
Yuwei Yang
Australian National University
F
Fatima Zohra Daha
Microsoft
Ji Li
Ji Li
Principal Group Science Manager at Microsoft
AICAD
Zhouhui Lian
Zhouhui Lian
Peking University
Computer GraphicsComputer VisionAI