NuScenes-SpatialQA: A Spatial Understanding and Reasoning Benchmark for Vision-Language Models in Autonomous Driving

📅 2025-04-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language models (VLMs) exhibit severe limitations in spatial understanding and reasoning for autonomous driving, compounded by the absence of a systematic, real-world–oriented evaluation benchmark. Method: We introduce AutoSpatialBench—the first spatial understanding and reasoning benchmark for autonomous driving grounded in real-world 3D scene graphs—covering directional, distance, topological, and dynamic spatial relations. Our approach features a fully automated pipeline for 3D scene graph construction and structured question generation, built upon the NuScenes dataset to ensure high fidelity and interpretability; it further incorporates multi-granularity spatial relation annotation and a dual-track evaluation framework combining quantitative metrics and qualitative analysis. Contribution/Results: Experiments reveal persistent bottlenecks in quantitative spatial reasoning across state-of-the-art VLMs—including those with explicit spatial enhancements—delivering the first comprehensive, cross-model spatial capability assessment and clearly identifying critical capability gaps.

Technology Category

Application Category

📝 Abstract
Recent advancements in Vision-Language Models (VLMs) have demonstrated strong potential for autonomous driving tasks. However, their spatial understanding and reasoning-key capabilities for autonomous driving-still exhibit significant limitations. Notably, none of the existing benchmarks systematically evaluate VLMs' spatial reasoning capabilities in driving scenarios. To fill this gap, we propose NuScenes-SpatialQA, the first large-scale ground-truth-based Question-Answer (QA) benchmark specifically designed to evaluate the spatial understanding and reasoning capabilities of VLMs in autonomous driving. Built upon the NuScenes dataset, the benchmark is constructed through an automated 3D scene graph generation pipeline and a QA generation pipeline. The benchmark systematically evaluates VLMs' performance in both spatial understanding and reasoning across multiple dimensions. Using this benchmark, we conduct extensive experiments on diverse VLMs, including both general and spatial-enhanced models, providing the first comprehensive evaluation of their spatial capabilities in autonomous driving. Surprisingly, the experimental results show that the spatial-enhanced VLM outperforms in qualitative QA but does not demonstrate competitiveness in quantitative QA. In general, VLMs still face considerable challenges in spatial understanding and reasoning.
Problem

Research questions and friction points this paper is trying to address.

Evaluating VLMs' spatial reasoning in autonomous driving
Assessing spatial understanding in driving scenarios
Benchmarking VLMs' quantitative and qualitative spatial QA
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated 3D scene graph generation pipeline
Large-scale ground-truth-based QA benchmark
Multi-dimensional spatial capability evaluation
🔎 Similar Papers
No similar papers found.
Kexin Tian
Kexin Tian
Texas A&M University
Autonomous DrivingVision-Language Models
J
Jingrui Mao
Texas A&M University
Y
Yunlong Zhang
Texas A&M University
J
Jiwan Jiang
University of Wisconsin-Madison
Y
Yang Zhou
Texas A&M University
Zhengzhong Tu
Zhengzhong Tu
Texas A&M University, Google Research, University of Texas at Austin
Agentic AITrustworthy AIEmbodied AI