How Well Do Vision-Language Models Understand Sequential Driving Scenes? A Sensitivity Study

📅 2026-04-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language models (VLMs) lack systematic evaluation of their temporal reasoning capabilities in continuous driving scenarios, and the impact of input configurations on their performance remains unclear. This work proposes the VENUSS framework, which constructs a temporal benchmark comprising over 2,600 real-world driving video scenes to conduct the first systematic sensitivity analysis of more than 25 state-of-the-art VLMs across multidimensional input configurations—including resolution, frame count, temporal interval, spatial layout, and presentation mode. Experimental results reveal that even the best-performing VLM achieves only 57% accuracy on this task, significantly below the human baseline of 65%. While models excel at static object recognition, they exhibit notable weaknesses in understanding vehicle dynamics and temporal relationships, establishing a reproducible evaluation baseline for future research.
📝 Abstract
Vision-Language Models (VLMs) are increasingly proposed for autonomous driving tasks, yet their performance on sequential driving scenes remains poorly characterized, particularly regarding how input configurations affect their capabilities. We introduce VENUSS (VLM Evaluation oN Understanding Sequential Scenes), a framework for systematic sensitivity analysis of VLM performance on sequential driving scenes, establishing baselines for future research. Building upon existing datasets, VENUSS extracts temporal sequences from driving videos, and generates structured evaluations across custom categories. By comparing 25+ existing VLMs across 2,600+ scenarios, we reveal how even top models achieve only 57% accuracy, not matching human performance in similar constraints (65%) and exposing significant capability gaps. Our analysis shows that VLMs excel with static object detection but struggle with understanding the vehicle dynamics and temporal relations. VENUSS offers the first systematic sensitivity analysis of VLMs focused on how input image configurations - resolution, frame count, temporal intervals, spatial layouts, and presentation modes - affect performance on sequential driving scenes. Supplementary material available at https://V3NU55.github.io
Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models
Sequential Driving Scenes
Temporal Relations
Vehicle Dynamics
Input Configuration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Models
Sequential Driving Scenes
Sensitivity Analysis
Temporal Understanding
VENUSS
🔎 Similar Papers
No similar papers found.
R
Roberto Brusnicki
Professorship of Autonomous Vehicle Systems, TUM School of Engineering and Design, Technical University of Munich, 85748 Garching, Germany; Munich Institute of Robotics and Machine Intelligence (MIRMI)
Mattia Piccinini
Mattia Piccinini
TUM Global Post-doc Researcher, Technical University of Munich
Autonomous VehiclesArtificial IntelligenceRoboticsTrajectory PlanningMotion Control
Johannes Betz
Johannes Betz
Professor, Autonomous Vehicle Systems, Technical University of Munich (TUM)
Autonomous SystemsMotion PlaningControlRobots