VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use

📅 2025-05-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language models (VLMs) primarily rely on purely textual chain-of-thought reasoning or static, test-time visual manipulations, lacking end-to-end multimodal chain reasoning capabilities trained jointly across modalities. This work introduces the first reinforcement learning (RL)-based fine-tuning framework tailored for VLMs, enabling autonomous generation of interleaved text-and-image multimodal reasoning chains. During inference, the model strategically invokes executable Python-based visual editing tools to produce intermediate images, optimizing end-to-end solely via sparse reward signals based on final answer correctness—without requiring supervision over intermediate steps. Our approach integrates outcome-driven RL fine-tuning, executable visual tool integration, and structured multimodal chain modeling. Empirical evaluation on chart- and table-based visual question answering demonstrates substantial accuracy improvements. Notably, this is the first framework to realize controllable, interpretable “thinking in images” for VLMs—establishing a foundation for truly multimodal, tool-augmented reasoning.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning Finetuning (RFT) has significantly advanced the reasoning capabilities of large language models (LLMs) by enabling long chains of thought, self-correction, and effective tool use. While recent works attempt to extend RFT to vision-language models (VLMs), these efforts largely produce text-only reasoning conditioned on static image inputs, falling short of true multimodal reasoning in the response. In contrast, test-time methods like Visual Sketchpad incorporate visual steps but lack training mechanisms. We introduce VTool-R1, the first framework that trains VLMs to generate multimodal chains of thought by interleaving text and intermediate visual reasoning steps. VTool-R1 integrates Python-based visual editing tools into the RFT process, enabling VLMs to learn when and how to generate visual reasoning steps that benefit final reasoning. Trained with outcome-based rewards tied to task accuracy, our approach elicits strategic visual tool use for reasoning without relying on process-based supervision. Experiments on structured visual question answering over charts and tables show that VTool-R1 enhances reasoning performance by teaching VLMs to"think with images"and generate multimodal chain of thoughts with tools.
Problem

Research questions and friction points this paper is trying to address.

Extends reinforcement learning to vision-language models for multimodal reasoning
Trains VLMs to interleave text and visual steps in reasoning
Enhances accuracy in visual question answering with tool use
Innovation

Methods, ideas, or system contributions that make the work stand out.

Trains VLMs with multimodal chains of thought
Integrates Python-based visual editing tools
Uses outcome-based rewards for strategic visual tool use
🔎 Similar Papers
No similar papers found.