🤖 AI Summary
To address the scarcity of high-quality multimodal reasoning data in iterative Multimodal Chain-of-Thought (iMCoT) visual reasoning, this paper proposes Self-Calling Chain-of-Thought (sCoT). sCoT decouples multimodal reasoning into a pure-language chain via a self-calling mechanism: a master agent decomposes tasks and orchestrates parameter-shared sub-agents that solve subproblems independently within isolated contexts. This work introduces the first paradigm shift from interleaved multimodal reasoning to a pure-language self-calling architecture, and further proposes Group Relative Policy Optimization (GRPO) to enhance reasoning policy learning. Experiments on HR-Bench 4K demonstrate that sCoT improves reasoning accuracy by 1.9% while reducing GPU training time by 75%, significantly outperforming strong baselines. The core contribution is a lightweight, efficient, and scalable new paradigm for multimodal reasoning—enabling effective reasoning without requiring large-scale, costly multimodal chain annotations.
📝 Abstract
Thinking-with-images paradigms have showcased remarkable visual reasoning capability by integrating visual information as dynamic elements into the Chain-of-Thought (CoT). However, optimizing interleaved multimodal CoT (iMCoT) through reinforcement learning remains challenging, as it relies on scarce high-quality reasoning data. In this study, we propose Self-Calling Chain-of-Thought (sCoT), a novel visual reasoning paradigm that reformulates iMCoT as a language-only CoT with self-calling. Specifically, a main agent decomposes the complex visual reasoning task to atomic subtasks and invokes its virtual replicas, i.e. parameter-sharing subagents, to solve them in isolated context. sCoT enjoys substantial training effectiveness and efficiency, as it requires no explicit interleaving between modalities. sCoT employs group-relative policy optimization to reinforce effective reasoning behavior to enhance optimization. Experiments on HR-Bench 4K show that sCoT improves the overall reasoning performance by up to $1.9%$ with $sim 75%$ fewer GPU hours compared to strong baseline approaches. Code is available at https://github.com/YWenxi/think-with-images-through-self-calling.