Perceive, Interact, Reason: Building Tool-Augmented Visual Agents for Spatial Reasoning

📅 2026-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing vision-language models struggle to recover fine-grained spatial information in tasks requiring active evidence acquisition and multi-step visual interaction. To address this limitation, this work proposes the PERIA agent, which features a lightweight dual-tool architecture: a perception tool that extracts textual, symbolic, and spatial evidence, and an interaction tool that manipulates visual context, tracks trajectories, and verifies relational structures. By integrating supervised trajectory synthesis with a novel Observation-Relaxed Group-in-Group Policy Optimization (OR-GIGPO) strategy, PERIA enables efficient multi-step reasoning. Evaluated across 13 benchmarks on 8 datasets, PERIA-8B substantially outperforms Qwen3-8B by +10.0% in-distribution and +4.4% out-of-distribution, surpassing current state-of-the-art models by 7.0%–14.8%, and achieves performance comparable to Qwen3-VL-235B-A22B-Thinking and GPT-5.

📝 Abstract

While recent vision-language models (VLMs) demonstrate strong multimodal understanding, they remain limited in spatial reasoning tasks that require active evidence acquisition and multi-step visual interaction. This limitation suggests that relying solely on implicit visual representations from vision encoders is insufficient for recovering fine-grained spatial evidence. We introduce PERception-Interaction-reason Agent (PERIA), a tool-augmented visual agent for spatial reasoning tasks across map reasoning, visual probing, and vision reconstruction. PERIA uses two lightweight tool families: vision perception tools for exposing textual, symbolic, and spatial evidence, and vision interaction tools for manipulating visual context, tracing paths, and verifying spatial relations. To train PERIA, we develop a unified recipe that combines supervised tool-use trajectory synthesis, composite rewards, and Observation-Relaxed Group-in-Group Policy Optimization (OR-GIGPO) for effective multi-tool behavior. Experiments on 13 benchmarks from 8 datasets show that PERIA-8B improves over the Qwen3-8B backbone by 10.0% on in-distribution benchmarks and 4.4% on out-of-distribution benchmarks, while outperforming previous state-of-the-art baselines of similar size by 7.0%-14.8%. It also achieves performance comparable to much larger models such as Qwen3-VL-235B-A22B-Thinking and GPT-5, demonstrating the effectiveness of PERIA in enhancing spatial reasoning capabilities.

Problem

Research questions and friction points this paper is trying to address.

spatial reasoning

vision-language models

visual interaction

evidence acquisition

tool-augmented agents

Innovation

Methods, ideas, or system contributions that make the work stand out.

tool-augmented agent

spatial reasoning

vision-language models