SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

πŸ“… 2026-06-08
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing benchmarks struggle to evaluate the interactive spatial understanding capabilities of multimodal agents in realistic settings. To address this gap, this work proposes SpatialWorld, a unified benchmark that integrates eight heterogeneous simulation backends and 760 human-annotated tasks, requiring agents to actively explore and complete complex real-world tasks from a first-person, partially observable perspective. SpatialWorld introduces the first simulator-agnostic protocol, enabling cross-domain, long-horizon, and active-perception-based spatial reasoning evaluation, and provides human-validated initial states, reference trajectories, and an automated final-state verification mechanism. Evaluation of 15 state-of-the-art agents reveals that even the strongest model, GPT-5, achieves only a 17.4% success rate, highlighting significant bottlenecks in current systems’ active exploration and planning efficiency.
πŸ“ Abstract
Spatial reasoning is a foundational capability for multimodal large language models (MLLMs) to perceive and operate within the physical world. However, existing benchmarks predominantly rely on passive evaluation (e.g., static VQA) or simulator-specific pipelines, failing to assess general interactive spatial understanding. We introduce SpatialWorld, a unified benchmark designed specifically for evaluating the interactive spatial understanding of multimodal agents in complex real-world tasks. Integrating eight heterogeneous simulation backends under a shared, simulator-agnostic protocol, SpatialWorld features 760 human-annotated tasks across diverse domains (e.g., household routines, travel, social collaboration). Agents must solve tasks under vision-only partial observability, actively gathering egocentric visual evidence and expressing decisions via a unified, text-based action interface native to MLLMs. For reliable evaluation, each task includes a human-validated initial state, a reference trajectory, and a terminal-state verifier. Evaluating 15 advanced agents reveals that robust spatial task solving remains challenging: the strongest model, GPT-5, achieves an average task success rate (TSR) of only 17.4%, while the leading open-source model, Qwen-3.5, reaches 14.1%. Further analysis exposes a clear mismatch between task success and execution efficiency, alongside substantial domain-specific performance variations. These bottlenecks in active exploration and long-horizon planning position SpatialWorld as a rigorous testbed for future spatial agents.
Problem

Research questions and friction points this paper is trying to address.

spatial reasoning
multimodal agents
interactive evaluation
real-world tasks
benchmarking
Innovation

Methods, ideas, or system contributions that make the work stand out.

interactive spatial reasoning
multimodal agents
simulator-agnostic benchmark
egocentric vision
real-world task evaluation
πŸ”Ž Similar Papers
No similar papers found.
Hongcheng Gao
Hongcheng Gao
University of Chinese Academy of Sciences
Natural Language ProcessingLarge Language ModelsVision Language Models
H
Hailong Qu
Chongqing University
J
Jingyi Tang
Peking University
Jiahao Wang
Jiahao Wang
Xi’an Jiaotong University
Multimodal AIDeep LearningElectrical Engineering
Z
Zihao Huang
Beijing Institute of Technology
H
Hengkang Qiao
Chongqing University
Shihong Huang
Shihong Huang
Professor of Information Systems, Carnegie Mellon University
Software EngineeringBrain Computer InteractionHuman Computer InteractionSelf-adaptive Systems
J
Junming Yang
Southeast University
Yi Li
Yi Li
Associate Professor of Sensing, Imaging & Tomography, Tsinghua University
multiphase flow measurementelectrical tomographysensing data fusionmachine learning
H
Hongyixuan Yuan
Chongqing University
W
Wenjie Li
Shanghai Jiao Tong University
Bohan Zeng
Bohan Zeng
PhD student, Peking University
Data-Centric AIComputer VisionDiffusion Model3D
Wenbo Li
Wenbo Li
The Chinese University of Hong Kong
Computer VisionDeep Learning
B
Bo Wang
Beijing Institute of Technology
Jianhui Liu
Jianhui Liu
PhD student, The University of Hong Kong
Robotic3D scene understanding6D Pose Estimation
O
Olive Huang
Peking University
Haoyang Huang
Haoyang Huang
JD Explore Academy (present) | StepFun | Microsoft Research
Multimodal & Multilingual Foundation Model
Wentao Zhang
Wentao Zhang
Institute of Physics, Chinese Academy of Sciences
photoemissionsuperconductivitycupratehtsctime-resolved
Guoqing Huang
Guoqing Huang
Professor of Civil Engineering, Chongqing University
Wind EngineeringStructural DynamicsRandom Vibration
Nan Duan
Nan Duan
JD.Com (now) | StepFun | Microsoft Research
NLPArtificial General Intelligence
Yinpeng Dong
Yinpeng Dong
Tsinghua University
Machine LearningDeep LearningAI Safety