GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning

๐Ÿ“… 2025-05-22
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Visual generation models struggle with complex text prompts involving multiple objects, precise spatial relationships, and attribute binding. To address this, we propose a reinforcement learningโ€“based two-stage multidimensional reward framework that employs a multimodal large language model (MLLM) as a discriminator to jointly model semantic, spatial, and visual rewards. Departing from rigid chain-of-thought (CoT) templates, our approach integrates generative chain-of-thought (GoT) with the Proximal Policy Optimization (PPO) algorithm to autonomously discover and optimize reasoning policies. This enables enhanced collaborative understanding and fine-grained object localization in compositional reasoning tasks. Evaluated on the T2I-CompBench benchmark, our method substantially outperforms prior approaches in spatial relation modeling and attribute binding, achieving significant gains in compositional text-to-image generation performance.

Technology Category

Application Category

๐Ÿ“ Abstract
Visual generation models have made remarkable progress in creating realistic images from text prompts, yet struggle with complex prompts that specify multiple objects with precise spatial relationships and attributes. Effective handling of such prompts requires explicit reasoning about the semantic content and spatial layout. We present GoT-R1, a framework that applies reinforcement learning to enhance semantic-spatial reasoning in visual generation. Building upon the Generation Chain-of-Thought approach, GoT-R1 enables models to autonomously discover effective reasoning strategies beyond predefined templates through carefully designed reinforcement learning. To achieve this, we propose a dual-stage multi-dimensional reward framework that leverages MLLMs to evaluate both the reasoning process and final output, enabling effective supervision across the entire generation pipeline. The reward system assesses semantic alignment, spatial accuracy, and visual quality in a unified approach. Experimental results demonstrate significant improvements on T2I-CompBench benchmark, particularly in compositional tasks involving precise spatial relationships and attribute binding. GoT-R1 advances the state-of-the-art in image generation by successfully transferring sophisticated reasoning capabilities to the visual generation domain. To facilitate future research, we make our code and pretrained models publicly available at https://github.com/gogoduan/GoT-R1.
Problem

Research questions and friction points this paper is trying to address.

Enhancing semantic-spatial reasoning in visual generation models
Improving handling of complex prompts with multiple objects
Advancing image generation with precise spatial relationships
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses reinforcement learning for semantic-spatial reasoning
Implements dual-stage multi-dimensional reward framework
Leverages MLLMs to evaluate reasoning and output
๐Ÿ”Ž Similar Papers
No similar papers found.