Active Exploring like a Pigeon: Reinforcing Spatial Reasoning via Agentic Vision-Language Models

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

Existing vision-language models struggle with complex spatial reasoning due to their passive observation paradigm and reliance on sparse reward signals. Inspired by pigeons’ ability to construct cognitive maps, this work proposes an active agent framework that incorporates a dynamic cognitive map as a persistent memory mechanism for scene layout. To enable precise and verifiable intermediate reasoning, the framework introduces Spatial Assertion Code (SAC)—a programmatic representation of spatial relationships that also provides dense reward signals. By combining supervised pretraining with reinforcement-based fine-tuning, the method achieves 80.5% overall accuracy on the MindCube benchmark and outperforms the current state of the art by 29.5 percentage points (a relative improvement of 53.2%) on the Rotation subset, substantially advancing the spatial reasoning capabilities of vision-language models.

📝 Abstract

Enabling Vision-Language Models (VLMs) to perform spatial reasoning remains challenging. Existing approaches treat VLMs as passive observers, which is difficult for real-world applications. Moreover, reinforcement learning methods rely on sparse rewards, limiting their effectiveness for complex reasoning tasks. Inspired by pigeons' building and exploiting cognitive maps for navigation, we propose a novel agentic pipeline for spatial reasoning. First, we introduce a new \emph{dynamic cognitive map} parameterizing scene layout as object positions and orientations, serving as persistent memory for new observations. Second, we propose a novel \emph{Spatial Assertion Codes (SAC)}, Python expressions programmatically describing spatial relationships. By collaborating with the dynamic cognitive map, SAC enables verification of intermediate reasoning steps, providing dense reward signals. We optimize the model via supervised and reinforcement finetuning. Experiments on the MindCube benchmark demonstrate state-of-the-art performance with \emph{80.5\%} overall accuracy, outperforming the best current method by \emph{29.5} accuracy points (a relative improvement of \emph{53.2\%}) on the challenging \textsc{Rotation} subset. Our code and data are open-sourced at https://github.com/dw-dengwei/active-spatial-reasoning.git.

Problem

Research questions and friction points this paper is trying to address.

spatial reasoning

Vision-Language Models

reinforcement learning

cognitive map

dense reward

Innovation

Methods, ideas, or system contributions that make the work stand out.

dynamic cognitive map

Spatial Assertion Codes

agentic vision-language models