🤖 AI Summary
Existing vision-language models (VLMs) primarily rely on purely textual chain-of-thought reasoning or static, test-time visual manipulations, lacking end-to-end multimodal chain reasoning capabilities trained jointly across modalities. This work introduces the first reinforcement learning (RL)-based fine-tuning framework tailored for VLMs, enabling autonomous generation of interleaved text-and-image multimodal reasoning chains. During inference, the model strategically invokes executable Python-based visual editing tools to produce intermediate images, optimizing end-to-end solely via sparse reward signals based on final answer correctness—without requiring supervision over intermediate steps. Our approach integrates outcome-driven RL fine-tuning, executable visual tool integration, and structured multimodal chain modeling. Empirical evaluation on chart- and table-based visual question answering demonstrates substantial accuracy improvements. Notably, this is the first framework to realize controllable, interpretable “thinking in images” for VLMs—establishing a foundation for truly multimodal, tool-augmented reasoning.
📝 Abstract
Reinforcement Learning Finetuning (RFT) has significantly advanced the reasoning capabilities of large language models (LLMs) by enabling long chains of thought, self-correction, and effective tool use. While recent works attempt to extend RFT to vision-language models (VLMs), these efforts largely produce text-only reasoning conditioned on static image inputs, falling short of true multimodal reasoning in the response. In contrast, test-time methods like Visual Sketchpad incorporate visual steps but lack training mechanisms. We introduce VTool-R1, the first framework that trains VLMs to generate multimodal chains of thought by interleaving text and intermediate visual reasoning steps. VTool-R1 integrates Python-based visual editing tools into the RFT process, enabling VLMs to learn when and how to generate visual reasoning steps that benefit final reasoning. Trained with outcome-based rewards tied to task accuracy, our approach elicits strategic visual tool use for reasoning without relying on process-based supervision. Experiments on structured visual question answering over charts and tables show that VTool-R1 enhances reasoning performance by teaching VLMs to"think with images"and generate multimodal chain of thoughts with tools.