NuScenes-SpatialQA: A Spatial Understanding and Reasoning Benchmark for Vision-Language Models in Autonomous Driving

📅 2025-04-04

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing vision-language models (VLMs) exhibit severe limitations in spatial understanding and reasoning for autonomous driving, compounded by the absence of a systematic, real-world–oriented evaluation benchmark. Method: We introduce AutoSpatialBench—the first spatial understanding and reasoning benchmark for autonomous driving grounded in real-world 3D scene graphs—covering directional, distance, topological, and dynamic spatial relations. Our approach features a fully automated pipeline for 3D scene graph construction and structured question generation, built upon the NuScenes dataset to ensure high fidelity and interpretability; it further incorporates multi-granularity spatial relation annotation and a dual-track evaluation framework combining quantitative metrics and qualitative analysis. Contribution/Results: Experiments reveal persistent bottlenecks in quantitative spatial reasoning across state-of-the-art VLMs—including those with explicit spatial enhancements—delivering the first comprehensive, cross-model spatial capability assessment and clearly identifying critical capability gaps.

Technology Category

Application Category

📝 Abstract

Recent advancements in Vision-Language Models (VLMs) have demonstrated strong potential for autonomous driving tasks. However, their spatial understanding and reasoning-key capabilities for autonomous driving-still exhibit significant limitations. Notably, none of the existing benchmarks systematically evaluate VLMs' spatial reasoning capabilities in driving scenarios. To fill this gap, we propose NuScenes-SpatialQA, the first large-scale ground-truth-based Question-Answer (QA) benchmark specifically designed to evaluate the spatial understanding and reasoning capabilities of VLMs in autonomous driving. Built upon the NuScenes dataset, the benchmark is constructed through an automated 3D scene graph generation pipeline and a QA generation pipeline. The benchmark systematically evaluates VLMs' performance in both spatial understanding and reasoning across multiple dimensions. Using this benchmark, we conduct extensive experiments on diverse VLMs, including both general and spatial-enhanced models, providing the first comprehensive evaluation of their spatial capabilities in autonomous driving. Surprisingly, the experimental results show that the spatial-enhanced VLM outperforms in qualitative QA but does not demonstrate competitiveness in quantitative QA. In general, VLMs still face considerable challenges in spatial understanding and reasoning.

Problem

Research questions and friction points this paper is trying to address.

Evaluating VLMs' spatial reasoning in autonomous driving

Assessing spatial understanding in driving scenarios

Benchmarking VLMs' quantitative and qualitative spatial QA

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated 3D scene graph generation pipeline

Large-scale ground-truth-based QA benchmark

Multi-dimensional spatial capability evaluation

🔎 Similar Papers

No similar papers found.

Authors to Follow