Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes

📅 2025-10-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited multi-view spatial reasoning capability of vision-language models (VLMs) in multi-camera robotic settings. To this end, we introduce MV-RoboBench—the first systematic benchmark for evaluating multi-view spatial reasoning in embodied robotics. Built upon 1.7K human-annotated multi-view question-answer pairs, it covers eight subtasks across two categories: spatial understanding and robot execution. We further propose chain-of-thought–inspired heuristic reasoning and a standardized evaluation protocol. Experiments reveal, for the first time, a strong positive correlation between VLMs’ spatial intelligence and their robotic task performance; however, single-view benchmark scores fail to predict multi-view performance, and state-of-the-art models significantly underperform humans. This work fills a critical gap in embodied AI evaluation by establishing the first dedicated benchmark for multi-view spatial reasoning, providing both foundational resources and clear directions for future model development and benchmarking.

Technology Category

Application Category

📝 Abstract
Vision-language models (VLMs) are essential to Embodied AI, enabling robots to perceive, reason, and act in complex environments. They also serve as the foundation for the recent Vision-Language-Action (VLA) models. Yet most evaluations of VLMs focus on single-view settings, leaving their ability to integrate multi-view information underexplored. At the same time, multi-camera setups are increasingly standard in robotic platforms, as they provide complementary perspectives to mitigate occlusion and depth ambiguity. Whether VLMs can effectively leverage such multi-view inputs for robotic reasoning therefore remains an open question. To bridge this gap, we introduce MV-RoboBench, a benchmark specifically designed to evaluate the multi-view spatial reasoning capabilities of VLMs in robotic manipulation. MV-RoboBench consists of 1.7k manually curated QA items across eight subtasks, divided into two primary categories: spatial understanding and robotic execution. We evaluate a diverse set of existing VLMs, including both open-source and closed-source models, along with enhanced versions incorporating CoT-inspired techniques. The results show that state-of-the-art models remain far below human performance, underscoring the substantial challenges VLMs face in multi-view robotic perception. Additionally, our analysis uncovers two key findings: (i) spatial intelligence and robotic task execution are positively correlated in multi-view robotic scenarios; and (ii) strong performance on existing general-purpose single-view spatial understanding benchmarks does not reliably translate to success in the robotic spatial tasks assessed by our benchmark. We release MV-RoboBench as an open resource to foster progress in spatially grounded VLMs and VLAs, providing not only data but also a standardized evaluation protocol for multi-view embodied reasoning.
Problem

Research questions and friction points this paper is trying to address.

Evaluating multi-view spatial reasoning in robotic vision-language models
Assessing VLM capabilities for integrating multi-camera perspectives
Benchmarking robotic scene understanding across complementary viewpoints
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-view benchmark for spatial reasoning in robotics
Evaluates VLMs on 1.7k curated spatial and execution tasks
Identifies correlation between spatial intelligence and robotic execution
🔎 Similar Papers
No similar papers found.