STRIDE-QA: Visual Question Answering Dataset for Spatiotemporal Reasoning in Urban Driving Scenes

📅 2025-08-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language models (VLMs) exhibit severe limitations in spatiotemporal reasoning for dynamic traffic scenarios—critical for autonomous driving—due to reliance on static image-text pair training. To address this, we introduce STRIDE-QA, the first large-scale vision-language question-answering dataset tailored to egocentric driving scenes. It is built upon 100 hours of multi-sensor real-world road testing data collected in Tokyo, comprising 285K frames and 16M high-quality QA pairs. We propose three novel spatiotemporal joint-reasoning tasks, integrating 3D object detection, instance segmentation, and multi-object tracking to enable simultaneous egocentric spatial localization and future motion prediction. A physics-aware automatic annotation pipeline and a dedicated evaluation benchmark are established. Fine-tuned VLMs achieve 55% spatial localization accuracy and 28% trajectory consistency—substantially outperforming general-purpose VLMs, which yield near-zero performance on these tasks.

Technology Category

Application Category

📝 Abstract
Vision-Language Models (VLMs) have been applied to autonomous driving to support decision-making in complex real-world scenarios. However, their training on static, web-sourced image-text pairs fundamentally limits the precise spatiotemporal reasoning required to understand and predict dynamic traffic scenes. We address this critical gap with STRIDE-QA, a large-scale visual question answering (VQA) dataset for physically grounded reasoning from an ego-centric perspective. Constructed from 100 hours of multi-sensor driving data in Tokyo, capturing diverse and challenging conditions, STRIDE-QA is the largest VQA dataset for spatiotemporal reasoning in urban driving, offering 16 million QA pairs over 285K frames. Grounded by dense, automatically generated annotations including 3D bounding boxes, segmentation masks, and multi-object tracks, the dataset uniquely supports both object-centric and ego-centric reasoning through three novel QA tasks that require spatial localization and temporal prediction. Our benchmarks demonstrate that existing VLMs struggle significantly, achieving near-zero scores on prediction consistency. In contrast, VLMs fine-tuned on STRIDE-QA exhibit dramatic performance gains, achieving 55% success in spatial localization and 28% consistency in future motion prediction, compared to near-zero scores from general-purpose VLMs. Therefore, STRIDE-QA establishes a comprehensive foundation for developing more reliable VLMs for safety-critical autonomous systems.
Problem

Research questions and friction points this paper is trying to address.

Addresses VLMs' limitation in spatiotemporal reasoning for dynamic traffic scenes
Introduces STRIDE-QA dataset for ego-centric visual question answering in driving
Enables spatial localization and temporal prediction through dense automated annotations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale VQA dataset for spatiotemporal reasoning
Dense auto-generated 3D annotations for grounding
Novel QA tasks for spatial-temporal prediction
🔎 Similar Papers
No similar papers found.