Vision-language Models for Driver Monitoring Systems: A Driver Activity Description Dataset

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

Current vision-language models struggle to capture fine-grained distinctions in driving actions, limiting the reliability of driver monitoring systems. To address this, this work introduces the first fine-grained natural language description version of the Drive&Act dataset and proposes a method that integrates vision-language models (VLMs) with a large language model–based scoring mechanism to generate and evaluate driver behavior descriptions. The fine-tuned VLM achieves an ACCR score of 76, substantially outperforming zero-shot baselines (66), and demonstrates strong cross-dataset generalization on the Driver Monitoring Dataset. This study establishes a new paradigm for adapting and optimizing VLMs in domain-specific scenarios such as driver monitoring.

📝 Abstract

Understanding subtle driver actions is essential for building reliable driver monitoring systems. Existing visionlanguage models (VLMs) are trained on general datasets and struggle to recognize fine distinctions in driver behaviors. This paper addresses this limitation by creating a detailed natural language version of the Drive&Act dataset. We evaluate three VLMs on our new benchmark using LLM-based scoring methods. Their performance on the new benchmark shows that they cannot reliably generate accurate fine-grained driver activity descriptions. Based on the labeled Drive&Act dataset we create a new Drive&Act description dataset containing finegrained descriptions to train VLMs on driver activity understanding. Cross dataset evaluation on the Driver Monitoring Dataset (DMD) shows that the VLM fine-tuned on our new Drive&Act description dataset generalizes well to actions in the DMD dataset. The VLM fine-tuned on our Drive&Act description dataset achieves an ACCR score of 76 outperforming the zero-shot VLM baseline with an ACCR score of 66. These findings demonstrate that adapting VLMs with richly described driver actions can significantly improve their ability to interpret driver behavior while also highlighting the need for more diverse datasets to support broader generalization in future applications. Our Drive&Act description dataset and code will be publicly available on GitHub.

Problem

Research questions and friction points this paper is trying to address.

vision-language models

driver monitoring systems

fine-grained activity description

driver behavior understanding

dataset limitation

Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-language models

driver monitoring

fine-grained activity description