Thinking with Generated Images

πŸ“… 2025-05-28
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing large multimodal models (LMMs) suffer from limited visual reasoning capabilities due to reliance on fixed image inputs or purely textual chain-of-thought reasoning, lacking autonomous generation, critical evaluation, and iterative refinement of intermediate visual representations. This work introduces β€œGenerative Visual Thinking,” a novel paradigm enabling LMMs to actively perform visual imagination and self-correction during inference. Our method employs a joint text-image generation architecture that supports visual subgoal decomposition, multi-step diffusion-based generation, text-guided visual diagnosis, and feedback-driven representation reconstruction. Evaluated on visual generation benchmarks, it achieves a 19-percentage-point accuracy improvement (38% β†’ 57%), significantly enhancing comprehension and synthesis in complex, multi-object scenes. The implementation, including code and a comprehensive toolkit, is publicly released.

Technology Category

Application Category

πŸ“ Abstract
We present Thinking with Generated Images, a novel paradigm that fundamentally transforms how large multimodal models (LMMs) engage with visual reasoning by enabling them to natively think across text and vision modalities through spontaneous generation of intermediate visual thinking steps. Current visual reasoning with LMMs is constrained to either processing fixed user-provided images or reasoning solely through text-based chain-of-thought (CoT). Thinking with Generated Images unlocks a new dimension of cognitive capability where models can actively construct intermediate visual thoughts, critique their own visual hypotheses, and refine them as integral components of their reasoning process. We demonstrate the effectiveness of our approach through two complementary mechanisms: (1) vision generation with intermediate visual subgoals, where models decompose complex visual tasks into manageable components that are generated and integrated progressively, and (2) vision generation with self-critique, where models generate an initial visual hypothesis, analyze its shortcomings through textual reasoning, and produce refined outputs based on their own critiques. Our experiments on vision generation benchmarks show substantial improvements over baseline approaches, with our models achieving up to 50% (from 38% to 57%) relative improvement in handling complex multi-object scenarios. From biochemists exploring novel protein structures, and architects iterating on spatial designs, to forensic analysts reconstructing crime scenes, and basketball players envisioning strategic plays, our approach enables AI models to engage in the kind of visual imagination and iterative refinement that characterizes human creative, analytical, and strategic thinking. We release our open-source suite at https://github.com/GAIR-NLP/thinking-with-generated-images.
Problem

Research questions and friction points this paper is trying to address.

Enabling LMMs to generate intermediate visual thoughts for reasoning
Improving visual reasoning by integrating self-critique and refinement
Enhancing complex multi-object scenario handling via visual subgoals
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates intermediate visual thinking steps
Decomposes tasks into visual subgoals
Refines outputs via self-critique mechanism
πŸ”Ž Similar Papers
No similar papers found.
Ethan Chern
Ethan Chern
Shanghai Jiao Tong University
Machine LearningNatural Language ProcessingArtificial Intelligence
Z
Zhulin Hu
Shanghai Jiao Tong University, Generative AI Research Lab (GAIR)
Steffi Chern
Steffi Chern
University of Pennsylvania
Natural Language ProcessingArtificial Intelligence
Siqi Kou
Siqi Kou
Shanghai Jiaotong university
Machine Learning
J
Jiadi Su
Fudan University, Generative AI Research Lab (GAIR)
Y
Yan Ma
Fudan University, Generative AI Research Lab (GAIR)
Z
Zhijie Deng
Shanghai Jiao Tong University
P
Pengfei Liu
Shanghai Jiao Tong University, SII, Generative AI Research Lab (GAIR)