SpatialReasoner: Active Perception for Large-Scale 3D Scene Understanding

📅 2025-12-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language models exhibit limited spatial reasoning capabilities in large-scale 3D environments (e.g., multi-floor houses), hindering efficient visual question answering (VQA). To address this, we propose SpatialReasoner, an active perception framework for house-level scene understanding, introducing the novel “text-driven–tool-calling–hierarchical-exploration” paradigm. We construct H²U3D—the first benchmark supporting multi-floor, large-scale 3D VQA—and design a coarse-to-fine active exploration strategy that drastically reduces image acquisition. Hierarchical visual representations are automatically annotated and leveraged within a reinforcement learning framework combining supervised cold-start initialization and adaptive exploration rewards, enabling chain-of-thought–driven answer generation. On H²U3D, SpatialReasoner achieves state-of-the-art performance, requiring only 3–4 images on average for inference—substantially outperforming prior methods needing ≥16 images and strong baselines including GPT-4o and Gemini-2.5-Pro.

Technology Category

Application Category

📝 Abstract
Spatial reasoning in large-scale 3D environments remains challenging for current vision-language models, which are typically constrained to room-scale scenarios. We introduce H$^2$U3D (Holistic House Understanding in 3D), a 3D visual question answering dataset designed for house-scale scene understanding. H$^2$U3D features multi-floor environments spanning up to three floors and 10-20 rooms, covering more than 300 m$^2$. Through an automated annotation pipeline, it constructs hierarchical coarse-to-fine visual representations and generates diverse question-answer pairs with chain-of-thought annotations. We further propose SpatialReasoner, an active perception framework that autonomously invokes spatial tools to explore 3D scenes based on textual queries. SpatialReasoner is trained through a two-stage strategy: a supervised cold start followed by reinforcement learning with an adaptive exploration reward that promotes efficient exploration while discouraging redundant operations. Extensive experiments demonstrate that SpatialReasoner achieves state-of-the-art performance on H$^2$U3D, outperforming strong baselines including GPT-4o and Gemini-2.5-Pro. Notably, our method attains superior results while using only 3-4 images in total on average, compared to baselines requiring 16+ images, highlighting the effectiveness of our coarse-to-fine active exploration paradigm.
Problem

Research questions and friction points this paper is trying to address.

Addresses spatial reasoning in large-scale 3D house environments
Introduces a dataset for holistic 3D visual question answering
Proposes an active perception framework for efficient scene exploration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated annotation pipeline for hierarchical visual representations
Active perception framework invoking spatial tools autonomously
Two-stage training with supervised cold start and reinforcement learning
🔎 Similar Papers
No similar papers found.
H
Hongpei Zheng
University of Manchester
S
Shijie Li
Institute for Infocomm Research (I2R), A*STAR, Singapore
Y
Yanran Li
University of Bedfordshire
Hujun Yin
Hujun Yin
School of Electrical and Electronic Engineering, The University of Manchester
Neural networksimage processingface recognitiondimension reductiontime series