Simple o3: Towards Interleaved Vision-Language Reasoning

📅 2025-08-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) exhibit limited capability in long-horizon visual reasoning, particularly in emulating the dynamic perception-and-reasoning process underlying human “thinking while looking.” To address this, we propose Simple o3, an end-to-end interleaved reasoning framework that integrates executable visual operations—such as cropping, scaling, and reusing image regions—into an “observe-reason-act” loop. It introduces auxiliary visual tokens to enhance fine-grained perceptual grounding and establishes a high-quality synthetic data pipeline, releasing TWI-Tools-146K—the first dataset explicitly designed for tool-augmented visual reasoning. We conduct the first systematic ablation study on interleaving strategies, demonstrating that image reuse, precise localization-based cropping, and adaptive magnification significantly improve multi-step visual reasoning performance. Our method achieves state-of-the-art results across multiple benchmarks.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) have shown impressive performance on vision-language tasks, but their long Chain-of-Thought (CoT) capabilities in multimodal scenarios remain underexplored. Inspired by OpenAI's o3 model, which emulates human-like ''thinking with image'' through iterative visual transformations and linguistic reasoning, we propose Simple o3, an end-to-end framework that integrates dynamic tool interactions (e.g., cropping, zooming, and reusing) into interleaved vision-language reasoning via supervised fine-tuning (SFT). Our approach features a scalable data synthesis pipeline that generates high-quality interleaved vision-language reasoning chains via an ''observe-reason-act'' cycle, complete with executable visual operations and rigorous verification, yielding the open-source TWI-Tools-146K dataset. Experimental results demonstrate Simple o3's superior performance on diverse benchmarks, outperforming existing approaches. By combining enhanced reasoning capabilities, Simple o3 establishes a powerful yet computationally affordable paradigm for advancing multimodal reasoning. Remarkably, we provide the first in-depth analysis of different interleaved reasoning strategies, offering insights into their impact on model performance. We found that by introducing additional visual tokens for interleaved vision-language reasoning, reusing and magnifying the original image significantly improves the model's visual reasoning and fine-grained perception, while image cropping based on precise visual grounding allows the model to effectively focus on key entities or regions, further enhancing its capabilities.
Problem

Research questions and friction points this paper is trying to address.

Enhancing long multimodal Chain-of-Thought reasoning in MLLMs
Integrating dynamic visual tools into vision-language reasoning
Analyzing interleaved reasoning strategies for improved model performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic tool interactions for vision-language reasoning
Scalable data synthesis with observe-reason-act cycle
Enhanced visual tokens improve reasoning and perception
🔎 Similar Papers
No similar papers found.
Y
Ye Wang
Fudan University
Q
Qianglong Chen
Independent Researcher
Zejun Li
Zejun Li
Fudan University
vision-languagemulti-modality
S
Siyuan Wang
University of Southern California
S
Shijie Guo
Fudan University
Z
Zhirui Zhang
Independent Researcher
Z
Zhongyu Wei
Fudan University