SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work addresses the systematic deficiency of multimodal large language models (MLLMs) in 3D spatial understanding. To this end, we introduce SpatialScore—the first unified, comprehensive benchmark for multimodal spatial reasoning—comprising 28K samples and a challenging subset, SpatialScore-Hard. We further propose SpatialAgent, a multi-tool collaborative reasoning framework supporting both Plan-Execute and ReAct paradigms. Our methodology integrates visual geometric perception evaluation (VGBench), multi-source 3D data fusion, and structured spatial question-answering modeling. Experiments reveal substantial performance gaps in MLLMs on geometric tasks such as camera pose estimation and motion reasoning. SpatialAgent significantly improves spatial reasoning capabilities across benchmarks. This work establishes a reproducible evaluation standard and provides a principled technical pathway toward embodied intelligence.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) have achieved impressive success in question-answering tasks, yet their capabilities for spatial understanding are less explored. This work investigates a critical question: do existing MLLMs possess 3D spatial perception and understanding abilities? Concretely, we make the following contributions in this paper: (i) we introduce VGBench, a benchmark specifically designed to assess MLLMs for visual geometry perception, e.g., camera pose and motion estimation; (ii) we propose SpatialScore, the most comprehensive and diverse multimodal spatial understanding benchmark to date, integrating VGBench with relevant data from the other 11 existing datasets. This benchmark comprises 28K samples across various spatial understanding tasks, modalities, and QA formats, along with a carefully curated challenging subset, SpatialScore-Hard; (iii) we develop SpatialAgent, a novel multi-agent system incorporating 9 specialized tools for spatial understanding, supporting both Plan-Execute and ReAct reasoning paradigms; (iv) we conduct extensive evaluations to reveal persistent challenges in spatial reasoning while demonstrating the effectiveness of SpatialAgent. We believe SpatialScore will offer valuable insights and serve as a rigorous benchmark for the next evolution of MLLMs.

Problem

Research questions and friction points this paper is trying to address.

Assessing MLLMs' 3D spatial perception abilities

Creating a unified benchmark for multimodal spatial understanding

Developing tools to enhance spatial reasoning in MLLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces VGBench for visual geometry assessment

Proposes comprehensive SpatialScore benchmark

Develops SpatialAgent multi-agent system

🔎 Similar Papers

No similar papers found.