How Well Do Vision-Language Models Understand Sequential Driving Scenes? A Sensitivity Study

📅 2026-04-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current vision-language models (VLMs) lack systematic evaluation of their temporal reasoning capabilities in continuous driving scenarios, and the impact of input configurations on their performance remains unclear. This work proposes the VENUSS framework, which constructs a temporal benchmark comprising over 2,600 real-world driving video scenes to conduct the first systematic sensitivity analysis of more than 25 state-of-the-art VLMs across multidimensional input configurations—including resolution, frame count, temporal interval, spatial layout, and presentation mode. Experimental results reveal that even the best-performing VLM achieves only 57% accuracy on this task, significantly below the human baseline of 65%. While models excel at static object recognition, they exhibit notable weaknesses in understanding vehicle dynamics and temporal relationships, establishing a reproducible evaluation baseline for future research.

📝 Abstract

Vision-Language Models (VLMs) are increasingly proposed for autonomous driving tasks, yet their performance on sequential driving scenes remains poorly characterized, particularly regarding how input configurations affect their capabilities. We introduce VENUSS (VLM Evaluation oN Understanding Sequential Scenes), a framework for systematic sensitivity analysis of VLM performance on sequential driving scenes, establishing baselines for future research. Building upon existing datasets, VENUSS extracts temporal sequences from driving videos, and generates structured evaluations across custom categories. By comparing 25+ existing VLMs across 2,600+ scenarios, we reveal how even top models achieve only 57% accuracy, not matching human performance in similar constraints (65%) and exposing significant capability gaps. Our analysis shows that VLMs excel with static object detection but struggle with understanding the vehicle dynamics and temporal relations. VENUSS offers the first systematic sensitivity analysis of VLMs focused on how input image configurations - resolution, frame count, temporal intervals, spatial layouts, and presentation modes - affect performance on sequential driving scenes. Supplementary material available at https://V3NU55.github.io

Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models

Sequential Driving Scenes

Temporal Relations

Vehicle Dynamics

Input Configuration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Models

Sequential Driving Scenes

Sensitivity Analysis

Temporal Understanding

VENUSS

🔎 Similar Papers

No similar papers found.

Authors to Follow