Thinking with Images via Self-Calling Agent

📅 2025-12-09

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

To address the scarcity of high-quality multimodal reasoning data in iterative Multimodal Chain-of-Thought (iMCoT) visual reasoning, this paper proposes Self-Calling Chain-of-Thought (sCoT). sCoT decouples multimodal reasoning into a pure-language chain via a self-calling mechanism: a master agent decomposes tasks and orchestrates parameter-shared sub-agents that solve subproblems independently within isolated contexts. This work introduces the first paradigm shift from interleaved multimodal reasoning to a pure-language self-calling architecture, and further proposes Group Relative Policy Optimization (GRPO) to enhance reasoning policy learning. Experiments on HR-Bench 4K demonstrate that sCoT improves reasoning accuracy by 1.9% while reducing GPU training time by 75%, significantly outperforming strong baselines. The core contribution is a lightweight, efficient, and scalable new paradigm for multimodal reasoning—enabling effective reasoning without requiring large-scale, costly multimodal chain annotations.

Technology Category

Application Category

📝 Abstract

Thinking-with-images paradigms have showcased remarkable visual reasoning capability by integrating visual information as dynamic elements into the Chain-of-Thought (CoT). However, optimizing interleaved multimodal CoT (iMCoT) through reinforcement learning remains challenging, as it relies on scarce high-quality reasoning data. In this study, we propose Self-Calling Chain-of-Thought (sCoT), a novel visual reasoning paradigm that reformulates iMCoT as a language-only CoT with self-calling. Specifically, a main agent decomposes the complex visual reasoning task to atomic subtasks and invokes its virtual replicas, i.e. parameter-sharing subagents, to solve them in isolated context. sCoT enjoys substantial training effectiveness and efficiency, as it requires no explicit interleaving between modalities. sCoT employs group-relative policy optimization to reinforce effective reasoning behavior to enhance optimization. Experiments on HR-Bench 4K show that sCoT improves the overall reasoning performance by up to $1.9%$ with $sim 75%$ fewer GPU hours compared to strong baseline approaches. Code is available at https://github.com/YWenxi/think-with-images-through-self-calling.

Problem

Research questions and friction points this paper is trying to address.

Optimizes multimodal Chain-of-Thought without interleaving modalities

Reduces reliance on scarce high-quality visual reasoning data

Enhances training efficiency and reasoning performance with fewer resources

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-calling agent decomposes tasks into atomic subtasks

Parameter-sharing subagents solve subtasks in isolated context

Group-relative policy optimization reinforces effective reasoning behavior

🔎 Similar Papers

No similar papers found.