Evaluating Small Vision-Language Models on Distance-Dependent Traffic Perception

📅 2025-10-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Small vision-language models (Small VLMs) exhibit insufficient distance-sensitive perception—critical for safety-critical applications like autonomous driving. Method: We introduce DTPQA, the first traffic-scene-specific visual question-answering benchmark with fine-grained distance annotations, designed to decouple and evaluate pure perceptual performance at near range (≤20 m) versus far range (≥30 m), while eliminating confounding effects of high-level reasoning. Contribution/Results: Comprehensive evaluation of mainstream Small VLMs on DTPQA reveals an average accuracy of only ~60%, substantially below human performance (85%); severe deficits are observed in far-range object recognition and left/right discrimination—fundamental perceptual tasks. This work is the first to empirically expose structural weaknesses in Small VLMs’ distance-aware perception, establishing a novel, rigorous evaluation paradigm and empirical foundation for trustworthy in-vehicle visual understanding.

Technology Category

Application Category

📝 Abstract
Vision-Language Models (VLMs) are becoming increasingly powerful, demonstrating strong performance on a variety of tasks that require both visual and textual understanding. Their strong generalisation abilities make them a promising component for automated driving systems, which must handle unexpected corner cases. However, to be trusted in such safety-critical applications, a model must first possess a reliable perception system. Moreover, since critical objects and agents in traffic scenes are often at a distance, we require systems that are not"shortsighted", i.e., systems with strong perception capabilities at both close (up to 20 meters) and long (30+ meters) range. With this in mind, we introduce Distance-Annotated Traffic Perception Question Answering (DTPQA), the first Visual Question Answering (VQA) benchmark focused solely on perception-based questions in traffic scenes, enriched with distance annotations. By excluding questions that require reasoning, we ensure that model performance reflects perception capabilities alone. Since automated driving hardware has limited processing power and cannot support large VLMs, our study centers on smaller VLMs. More specifically, we evaluate several state-of-the-art (SOTA) small VLMs on DTPQA and show that, despite the simplicity of the questions, these models significantly underperform compared to humans (~60% average accuracy for the best-performing small VLM versus ~85% human performance). However, it is important to note that the human sample size was relatively small, which imposes statistical limitations. We also identify specific perception tasks, such as distinguishing left from right, that remain particularly challenging for these models.
Problem

Research questions and friction points this paper is trying to address.

Evaluating small VLMs' traffic perception accuracy at varying distances
Assessing perception-only capabilities through distance-annotated VQA benchmark
Identifying performance gaps between small VLMs and human perception
Innovation

Methods, ideas, or system contributions that make the work stand out.

Distance-Annotated Traffic Perception Question Answering benchmark
Evaluating small vision-language models on traffic perception
Focusing on perception capabilities at various distances
🔎 Similar Papers
No similar papers found.
N
Nikos Theodoridis
Department of Electronic and Computer Engineering, University of Limerick, Castletroy, Co. Limerick V94 T9PX, Ireland
Tim Brophy
Tim Brophy
University of Galway
R
Reenu Mohandas
Department of Electronic and Computer Engineering, University of Limerick, Castletroy, Co. Limerick V94 T9PX, Ireland
Ganesh Sistu
Ganesh Sistu
Principal Artificial Intelligence Architect, Valeo Ireland
Autonomous DrivingMachine LearningComputer VisionDeep Learning
F
Fiachra Collins
Valeo Vision Systems, Dunmore Road, Tuam, Co. Galway H54 Y276, Ireland
A
Anthony G. Scanlan
Department of Electronic and Computer Engineering, University of Limerick, Castletroy, Co. Limerick V94 T9PX, Ireland
Ciarán Eising
Ciarán Eising
University of Limerick
computer vision