SITE: towards Spatial Intelligence Thorough Evaluation

πŸ“… 2025-05-08
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the lack of systematic evaluation of spatial intelligence (SI) in large vision-language models (VLMs). To this end, we introduce SITEβ€”the first standardized benchmark comprehensively covering single-image, multi-image, and video modalities across scales from graphical to environmental. Grounded in three foundational cognitive science taxonomies, SITE integrates 31 existing datasets and innovatively designs two task categories: viewpoint sampling and dynamic scene reasoning, both implemented via multiple-choice visual question answering. This enables holistic assessment across modalities, scales, static/dynamic conditions, and intrinsic/extrinsic spatial dimensions. Experiments reveal that state-of-the-art VLMs underperform humans by 32.7% on fundamental SI tasks such as spatial orientation; moreover, SI capability strongly correlates with embodied AI performance (r = 0.81). SITE thus provides a quantifiable, interpretable, and extensible evaluation metric for spatial reasoning in VLMs.

Technology Category

Application Category

πŸ“ Abstract
Spatial intelligence (SI) represents a cognitive ability encompassing the visualization, manipulation, and reasoning about spatial relationships, underpinning disciplines from neuroscience to robotics. We introduce SITE, a benchmark dataset towards SI Thorough Evaluation in a standardized format of multi-choice visual question-answering, designed to assess large vision-language models' spatial intelligence across diverse visual modalities (single-image, multi-image, and video) and SI factors (figural to environmental scales, spatial visualization and orientation, intrinsic and extrinsic, static and dynamic). Our approach to curating the benchmark combines a bottom-up survey about 31 existing datasets and a top-down strategy drawing upon three classification systems in cognitive science, which prompt us to design two novel types of tasks about view-taking and dynamic scenes. Extensive experiments reveal that leading models fall behind human experts especially in spatial orientation, a fundamental SI factor. Moreover, we demonstrate a positive correlation between a model's spatial reasoning proficiency and its performance on an embodied AI task.
Problem

Research questions and friction points this paper is trying to address.

Evaluating spatial intelligence in vision-language models across diverse visual modalities.
Assessing models' spatial reasoning on figural to environmental scales and dynamic scenes.
Identifying gaps between AI and human performance in spatial orientation tasks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Standardized multi-choice visual question-answering benchmark
Combines bottom-up survey and top-down cognitive classifications
Introduces novel tasks for view-taking and dynamic scenes
πŸ”Ž Similar Papers
No similar papers found.