Embodied Scene Understanding for Vision Language Models via MetaVQA

📅 2025-01-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language models (VLMs) lack standardized benchmarks for spatial understanding and embodied decision-making, hindering their reliable deployment in traffic scenarios. To address this, we propose MetaVQA: the first automated visual question answering (VQA) generation framework integrating bird’s-eye-view (BEV) ground-truth annotations and Set-of-Mark prompting, enabling a closed-loop, simulation-driven spatial reasoning benchmark built upon nuScenes and Waymo. Methodologically, MetaVQA unifies object-centric embodied instruction modeling, top-down spatial annotation, VLM fine-tuning, and driving simulation in a tightly coupled闭环 loop. Experiments demonstrate that MetaVQA significantly improves VLMs’ spatial reasoning accuracy in safety-critical scenarios (+12.7%) and enhances the emergence of safe driving behaviors. Moreover, it achieves strong generalization from simulation to real-world observations, bridging the sim-to-real gap in autonomous driving perception and reasoning.

Technology Category

Application Category

📝 Abstract
Vision Language Models (VLMs) demonstrate significant potential as embodied AI agents for various mobility applications. However, a standardized, closed-loop benchmark for evaluating their spatial reasoning and sequential decision-making capabilities is lacking. To address this, we present MetaVQA: a comprehensive benchmark designed to assess and enhance VLMs' understanding of spatial relationships and scene dynamics through Visual Question Answering (VQA) and closed-loop simulations. MetaVQA leverages Set-of-Mark prompting and top-down view ground-truth annotations from nuScenes and Waymo datasets to automatically generate extensive question-answer pairs based on diverse real-world traffic scenarios, ensuring object-centric and context-rich instructions. Our experiments show that fine-tuning VLMs with the MetaVQA dataset significantly improves their spatial reasoning and embodied scene comprehension in safety-critical simulations, evident not only in improved VQA accuracies but also in emerging safety-aware driving maneuvers. In addition, the learning demonstrates strong transferability from simulation to real-world observation. Code and data will be publicly available at https://metadriverse.github.io/metavqa .
Problem

Research questions and friction points this paper is trying to address.

Visual Language Models
Spatial Understanding
Decision Making
Innovation

Methods, ideas, or system contributions that make the work stand out.

MetaVQA
Visual Language Models (VLMs)
Spatial Understanding Enhancement
🔎 Similar Papers
No similar papers found.