LMAD: Integrated End-to-End Vision-Language Model for Explainable Autonomous Driving

📅 2025-08-17

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current vision-language models (VLMs) for autonomous driving exhibit insufficient spatial perception and fine-grained scene understanding in complex environments, limiting interpretability of driving behavior and human–machine interaction. To address this, we propose LMAD—a novel end-to-end vision–language framework tailored for autonomous driving—that jointly processes multi-view camera inputs and scene-reasoning textual prompts to establish an initial scene interaction mechanism and incorporate a lightweight expert adapter, thereby significantly enhancing holistic environmental perception and spatial cognition. LMAD features a task-specific architecture designed for seamless integration with planning-oriented systems, balancing compatibility and extensibility. Evaluated on DriveLM and nuScenes-QA benchmarks, LMAD substantially outperforms existing VLMs in driving intention reasoning and scene question answering, establishing a new state-of-the-art for interpretable autonomous driving.

Technology Category

Application Category

📝 Abstract

Large vision-language models (VLMs) have shown promising capabilities in scene understanding, enhancing the explainability of driving behaviors and interactivity with users. Existing methods primarily fine-tune VLMs on on-board multi-view images and scene reasoning text, but this approach often lacks the holistic and nuanced scene recognition and powerful spatial awareness required for autonomous driving, especially in complex situations. To address this gap, we propose a novel vision-language framework tailored for autonomous driving, called LMAD. Our framework emulates modern end-to-end driving paradigms by incorporating comprehensive scene understanding and a task-specialized structure with VLMs. In particular, we introduce preliminary scene interaction and specialized expert adapters within the same driving task structure, which better align VLMs with autonomous driving scenarios. Furthermore, our approach is designed to be fully compatible with existing VLMs while seamlessly integrating with planning-oriented driving systems. Extensive experiments on the DriveLM and nuScenes-QA datasets demonstrate that LMAD significantly boosts the performance of existing VLMs on driving reasoning tasks,setting a new standard in explainable autonomous driving.

Problem

Research questions and friction points this paper is trying to address.

Enhances scene understanding for explainable autonomous driving

Improves spatial awareness in complex driving scenarios

Integrates vision-language models with planning-oriented driving systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end vision-language model for driving

Specialized expert adapters enhance alignment

Compatible with existing VLMs and systems

🔎 Similar Papers

No similar papers found.

Authors to Follow