GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Visual generation models struggle with complex text prompts involving multiple objects, precise spatial relationships, and attribute binding. To address this, we propose a reinforcement learning–based two-stage multidimensional reward framework that employs a multimodal large language model (MLLM) as a discriminator to jointly model semantic, spatial, and visual rewards. Departing from rigid chain-of-thought (CoT) templates, our approach integrates generative chain-of-thought (GoT) with the Proximal Policy Optimization (PPO) algorithm to autonomously discover and optimize reasoning policies. This enables enhanced collaborative understanding and fine-grained object localization in compositional reasoning tasks. Evaluated on the T2I-CompBench benchmark, our method substantially outperforms prior approaches in spatial relation modeling and attribute binding, achieving significant gains in compositional text-to-image generation performance.

Technology Category

Application Category

📝 Abstract

Visual generation models have made remarkable progress in creating realistic images from text prompts, yet struggle with complex prompts that specify multiple objects with precise spatial relationships and attributes. Effective handling of such prompts requires explicit reasoning about the semantic content and spatial layout. We present GoT-R1, a framework that applies reinforcement learning to enhance semantic-spatial reasoning in visual generation. Building upon the Generation Chain-of-Thought approach, GoT-R1 enables models to autonomously discover effective reasoning strategies beyond predefined templates through carefully designed reinforcement learning. To achieve this, we propose a dual-stage multi-dimensional reward framework that leverages MLLMs to evaluate both the reasoning process and final output, enabling effective supervision across the entire generation pipeline. The reward system assesses semantic alignment, spatial accuracy, and visual quality in a unified approach. Experimental results demonstrate significant improvements on T2I-CompBench benchmark, particularly in compositional tasks involving precise spatial relationships and attribute binding. GoT-R1 advances the state-of-the-art in image generation by successfully transferring sophisticated reasoning capabilities to the visual generation domain. To facilitate future research, we make our code and pretrained models publicly available at https://github.com/gogoduan/GoT-R1.

Problem

Research questions and friction points this paper is trying to address.

Enhancing semantic-spatial reasoning in visual generation models

Improving handling of complex prompts with multiple objects

Advancing image generation with precise spatial relationships

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses reinforcement learning for semantic-spatial reasoning

Implements dual-stage multi-dimensional reward framework

Leverages MLLMs to evaluate reasoning and output

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling