Collision-Aware Vision-Language Learning for End-to-End Driving with Multimodal Infraction Datasets

📅 2026-03-26
📈 Citations: 0
✹ Influential: 0
📄 PDF
đŸ€– AI Summary
This work addresses the performance limitations of end-to-end autonomous driving systems in closed-loop evaluation, which often suffer from high collision violation rates due to the absence of explicit collision modeling and multimodal supervisory signals. To this end, we propose VLAAD, a video–language augmented anomaly detector that introduces, for the first time, a collision-aware vision–language representation learning framework combined with multiple instance learning for temporally localized collision prediction. We further contribute two multimodal collision datasets—CARLA-Collide and Real-Collide—spanning diverse scenarios to support training and evaluation. As a lightweight, plug-and-play module, VLAAD improves the TransFuser++ driving score by 14.12% relative in CARLA closed-loop simulation and achieves a 23.3% higher AUC than billion-parameter baselines on real-world open-loop testing despite using only 0.6B parameters.

Technology Category

Application Category

📝 Abstract
High infraction rates remain the primary bottleneck for end-to-end (E2E) autonomous driving, as evidenced by the low driving scores on the CARLA Leaderboard. Despite collision-related infractions being the dominant failure mode in closed-loop evaluations, collision-aware representation learning has received limited attention. To address this gap, we first develop a Video-Language-Augmented Anomaly Detector (VLAAD), leveraging a Multiple Instance Learning (MIL) formulation to obtain stable, temporally localized collision signals for proactive prediction. To transition these capabilities into closed-loop simulations, we must overcome the limitations of existing simulator datasets, which lack multimodality and are frequently restricted to simple intersection scenarios. Therefore, we introduce CARLA-Collide, a large-scale multimodal dataset capturing realistic collision events across highly diverse road networks. Trained on this diverse simulator data, VLAAD serves as a collision-aware plug-in module that can be seamlessly integrated into existing E2E driving models. By integrating our module into a pretrained TransFuser++ agent, we demonstrate a 14.12% relative increase in driving score with minimal fine-tuning. Beyond closed-loop evaluation, we further assess the generalization capability of VLAAD in an open-loop setting using real-world driving data. To support this analysis, we introduce Real-Collide, a multimodal dataset of diverse dashcam videos paired with semantically rich annotations for collision detection and prediction. On this benchmark, despite containing only 0.6B parameters, VLAAD outperforms a multi-billion-parameter vision-language model, achieving a 23.3% improvement in AUC.
Problem

Research questions and friction points this paper is trying to address.

collision-aware learning
end-to-end driving
multimodal dataset
autonomous driving infractions
vision-language learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

collision-aware learning
vision-language modeling
multimodal dataset
end-to-end driving
anomaly detection
🔎 Similar Papers
No similar papers found.
A
Alex Koran
McGill University, Montréal, Canada; Mila - Québec AI Institute, Montréal, Canada
D
Dimitrios Sinodinos
McGill University, Montréal, Canada; Mila - Québec AI Institute, Montréal, Canada
Hadi Hojjati
Hadi Hojjati
AI Scientist, McGill University & Mila-Quebec AI Institute
Machine learningMultimodal LearningAnomaly DetectionSelf-Supervised Learning
T
Takuya Nanri
Nissan Motor Corporation, Yokohama, Japan
F
Fangge Chen
Nissan Motor Corporation, Yokohama, Japan
N
Narges Armanfard
McGill University, Montréal, Canada; Mila - Québec AI Institute, Montréal, Canada