GeoDrive-Bench: Benchmarking Region-Specific Multimodal Reasoning in Autonomous Driving

πŸ“… 2026-06-01
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

192K/year
πŸ€– AI Summary
This work addresses the limited adaptability of existing vision-language models to region-specific traffic regulations and their inability to reason about cultural and geographical differences in autonomous driving. To tackle this, the authors introduce GeoDrive-Bench, a benchmark comprising 5,053 human-verified multiple-choice question-answer pairs from six countries, targeting perception, prediction, planning, and region-aware reasoning tasks without explicit country labels. They propose a region-specific knowledge distillation approach to embed local traffic rules into the model’s internal representations and conduct systematic evaluations across nine state-of-the-art vision-language models. The results reveal substantial performance disparities across regions and demonstrate that the proposed method effectively enhances cross-regional driving reasoning capabilities, offering the first systematic validation and improvement of cultural and regulatory sensitivity in vision-language models for autonomous driving.
πŸ“ Abstract
Vision-language models (VLMs) for autonomous driving have shown promising performance, but their ability to handle region-specific traffic rules remains underexplored, raising uncertainties about their deployment across diverse global settings. We therefore introduce GeoDrive-Bench, a novel benchmark that enables the systematic investigation of VLMs' geo-culturally grounded driving reasoning. We curated 5,053 human-validated multiple-choice QA pairs across six countries covering diverse driving cultures. Specifically, we emphasize four driving tasks: perception, prediction, planning, and region reasoning. Each question requires models to infer the correct driving behavior from visual evidence and local traffic conventions without explicit country labels. Beyond evaluation, we further design a distillation algorithm that injects region-specific traffic-rule knowledge into the internal representations of VLMs, enabling models to better align visual scene understanding with local driving policies. Experiments on nine state-of-the-art VLMs show substantial performance variations across geo-driving cultures for each task, while our proposed baseline models exhibit improved geo-cultural reasoning across regions. These results suggest that current VLMs still lack robust region-aware driving intelligence and highlight GeoDrive-Bench as a diagnostic and training-oriented testbed for deployable autonomous driving foundation models.
Problem

Research questions and friction points this paper is trying to address.

autonomous driving
region-specific traffic rules
vision-language models
geo-cultural reasoning
multimodal reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

region-specific reasoning
vision-language models
autonomous driving benchmark
geo-cultural knowledge distillation
multimodal driving intelligence