SpatialAct: Probing Spatial Reasoning-to-Action Capabilities of VLM Agents in 3D Scenes

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Current vision-language models (VLMs) struggle to maintain coherent spatial understanding in 3D environments, often exhibiting inconsistent spatial beliefs during multi-turn interactions. To address this limitation, this work introduces SpatialAct—the first benchmark specifically designed for action-guided spatial reasoning—featuring structured tasks built within a simulated environment that encompass both multi-turn interaction optimization and single-step error correction, alongside a foundational capability diagnostic module. Experimental results demonstrate that while VLMs perform adequately on static spatial reasoning tasks, they significantly underperform humans in dynamic, multi-turn settings, revealing a fundamental deficiency in tracking evolving spatial states. This study establishes the first systematic evaluation framework for spatial reasoning under action-conditioned scenarios.

📝 Abstract

Humans can effortlessly perceive spatial layouts, form cognitive representations, reason about spatial relations, and translate such reasoning into actions in everyday 3D environments. Although recent vision-language models (VLMs) have shown promising performance on observation-conditioned spatial perception and reasoning tasks, it remains unclear whether they can build coherent spatial understanding, act upon it, and refine their actions through multi-turn feedback. To study this problem, we introduce \textbf{SpatialAct}, a simulator-grounded benchmark for probing \textit{action-conditioned spatial reasoning} in 3D scenes. Starting from the most challenging setting, Multi-turn Interactive Refinement, we further design its decomposed counterpart, Single-step Error Detection and Fix, together with five fundamental spatial ability tasks to diagnose the underlying causes of model failures. Experiments reveal a clear reasoning-to-action gap: current VLMs can perform well on isolated spatial reasoning tasks, but struggle to maintain coherent spatial beliefs and produce reliable actions during multi-turn feedback, substantially underperforming humans. These results suggest that current VLM agents still lack robust spatial state tracking under action-induced environment changes, even when low-level control is abstracted away.

Problem

Research questions and friction points this paper is trying to address.

spatial reasoning

vision-language models

3D scenes

action-conditioned reasoning

multi-turn interaction

Innovation

Methods, ideas, or system contributions that make the work stand out.

SpatialAct

action-conditioned spatial reasoning

vision-language models