GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing

📅 2025-03-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing image generation and editing methods rely on end-to-end text-to-image mapping, lacking semantic-spatial reasoning for visual composition and explicit manipulation. To address this, we propose Generation Chain-of-Thought (GoT), which formalizes image generation and editing as a stepwise language-based reasoning process: first parsing object semantics and spatial layouts, then guiding image synthesis—enabling user intervention within the reasoning chain for precise, controllable editing. Our key contributions are: (1) the first chain-of-thought–guided visual generation paradigm; (2) a large-scale, 9-million-sample GoT multimodal instruction dataset; and (3) a semantic-spatial guidance module that tightly couples Qwen2.5-VL with an end-to-end diffusion model. Experiments demonstrate that GoT significantly outperforms baselines in generation fidelity, intent alignment, and editing controllability. Code, data, and models are publicly released.

Technology Category

Application Category

📝 Abstract
Current image generation and editing methods primarily process textual prompts as direct inputs without reasoning about visual composition and explicit operations. We present Generation Chain-of-Thought (GoT), a novel paradigm that enables generation and editing through an explicit language reasoning process before outputting images. This approach transforms conventional text-to-image generation and editing into a reasoning-guided framework that analyzes semantic relationships and spatial arrangements. We define the formulation of GoT and construct large-scale GoT datasets containing over 9M samples with detailed reasoning chains capturing semantic-spatial relationships. To leverage the advantages of GoT, we implement a unified framework that integrates Qwen2.5-VL for reasoning chain generation with an end-to-end diffusion model enhanced by our novel Semantic-Spatial Guidance Module. Experiments show our GoT framework achieves excellent performance on both generation and editing tasks, with significant improvements over baselines. Additionally, our approach enables interactive visual generation, allowing users to explicitly modify reasoning steps for precise image adjustments. GoT pioneers a new direction for reasoning-driven visual generation and editing, producing images that better align with human intent. To facilitate future research, we make our datasets, code, and pretrained models publicly available at https://github.com/rongyaofang/GoT.
Problem

Research questions and friction points this paper is trying to address.

Enhances image generation and editing through reasoning
Transforms text-to-image processes with semantic-spatial analysis
Enables interactive visual generation with user-modifiable reasoning steps
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Generation Chain-of-Thought (GoT) paradigm
Integrates Qwen2.5-VL with diffusion model
Enables interactive visual generation and editing
🔎 Similar Papers
No similar papers found.