Thinking with Images via Self-Calling Agent

📅 2025-12-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the scarcity of high-quality multimodal reasoning data in iterative Multimodal Chain-of-Thought (iMCoT) visual reasoning, this paper proposes Self-Calling Chain-of-Thought (sCoT). sCoT decouples multimodal reasoning into a pure-language chain via a self-calling mechanism: a master agent decomposes tasks and orchestrates parameter-shared sub-agents that solve subproblems independently within isolated contexts. This work introduces the first paradigm shift from interleaved multimodal reasoning to a pure-language self-calling architecture, and further proposes Group Relative Policy Optimization (GRPO) to enhance reasoning policy learning. Experiments on HR-Bench 4K demonstrate that sCoT improves reasoning accuracy by 1.9% while reducing GPU training time by 75%, significantly outperforming strong baselines. The core contribution is a lightweight, efficient, and scalable new paradigm for multimodal reasoning—enabling effective reasoning without requiring large-scale, costly multimodal chain annotations.

Technology Category

Application Category

📝 Abstract
Thinking-with-images paradigms have showcased remarkable visual reasoning capability by integrating visual information as dynamic elements into the Chain-of-Thought (CoT). However, optimizing interleaved multimodal CoT (iMCoT) through reinforcement learning remains challenging, as it relies on scarce high-quality reasoning data. In this study, we propose Self-Calling Chain-of-Thought (sCoT), a novel visual reasoning paradigm that reformulates iMCoT as a language-only CoT with self-calling. Specifically, a main agent decomposes the complex visual reasoning task to atomic subtasks and invokes its virtual replicas, i.e. parameter-sharing subagents, to solve them in isolated context. sCoT enjoys substantial training effectiveness and efficiency, as it requires no explicit interleaving between modalities. sCoT employs group-relative policy optimization to reinforce effective reasoning behavior to enhance optimization. Experiments on HR-Bench 4K show that sCoT improves the overall reasoning performance by up to $1.9%$ with $sim 75%$ fewer GPU hours compared to strong baseline approaches. Code is available at https://github.com/YWenxi/think-with-images-through-self-calling.
Problem

Research questions and friction points this paper is trying to address.

Optimizes multimodal Chain-of-Thought without interleaving modalities
Reduces reliance on scarce high-quality visual reasoning data
Enhances training efficiency and reasoning performance with fewer resources
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-calling agent decomposes tasks into atomic subtasks
Parameter-sharing subagents solve subtasks in isolated context
Group-relative policy optimization reinforces effective reasoning behavior
🔎 Similar Papers
No similar papers found.
W
Wenxi Yang
University of Chinese Academy of Sciences
Y
Yuzhong Zhao
University of Chinese Academy of Sciences
F
Fang Wan
University of Chinese Academy of Sciences
Qixiang Ye
Qixiang Ye
University of Chinese Academy of Sciences, University of Maryland
Visual Object DetectionImage Processing