Scaling Agentic Reinforcement Learning for Tool-Integrated Reasoning in VLMs

📅 2025-11-24

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

Existing vision-language models (VLMs) excel at single-step image understanding but struggle with multi-step visual reasoning tasks requiring iterative tool selection, invocation, and coordination. To address this limitation, we propose VISTA-Gym—a scalable, large-scale training environment specifically designed for reinforcement learning of vision agents. It establishes a unified multimodal reasoning framework by (1) defining standardized visual tool interfaces (e.g., object localization, scene parsing), (2) introducing verifiable feedback mechanisms to guide agent behavior, and (3) optimizing the interaction loop via multi-turn trajectory sampling and end-to-end reinforcement learning. Our vision agent model, VISTA-R1-8B, trained within this framework, achieves significant improvements—+9.51% to +18.72%—over same-scale state-of-the-art models across 11 challenging visual question answering benchmarks. This marks the first systematic advancement in tool-driven, multi-step visual reasoning capability for vision agents.

Technology Category

Application Category

📝 Abstract

While recent vision-language models (VLMs) demonstrate strong image understanding, their ability to "think with images", i.e., to reason through multi-step visual interactions, remains limited. We introduce VISTA-Gym, a scalable training environment for incentivizing tool-integrated visual reasoning capabilities in VLMs. VISTA-Gym unifies diverse real-world multimodal reasoning tasks (7 tasks from 13 datasets in total) with a standardized interface for visual tools (e.g., grounding, parsing), executable interaction loops, verifiable feedback signals, and efficient trajectory logging, enabling visual agentic reinforcement learning at scale. While recent VLMs exhibit strong text-only reasoning, both proprietary and open-source models still struggle with tool selection, invocation, and coordination. With VISTA-Gym, we train VISTA-R1 to interleave tool-use with agentic reasoning via multi-turn trajectory sampling and end-to-end reinforcement learning. Extensive experiments across 11 public reasoning-intensive VQA benchmarks show that VISTA-R1-8B outperforms state-of-the-art baselines with similar sizes by 9.51%-18.72%, demonstrating VISTA-Gym as an effective training ground to unlock the tool-integrated reasoning capabilities for VLMs.

Problem

Research questions and friction points this paper is trying to address.

Enhancing multi-step visual reasoning in vision-language models

Addressing tool selection and coordination challenges in VLMs

Developing scalable training for tool-integrated agentic reinforcement learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

VISTA-Gym environment standardizes multimodal reasoning tasks

Agentic reinforcement learning trains tool-use via trajectory sampling

End-to-end training improves tool selection and coordination

🔎 Similar Papers

StepTool: Enhancing Multi-Step Tool Usage in LLMs through Step-Grained Reinforcement Learning