🤖 AI Summary
Pure vision-based end-to-end autonomous driving systems often suffer from limited reasoning capabilities in complex, dynamic scenarios. Method: To address this, we propose the first knowledge-augmented vision-language model (VLM) framework for end-to-end driving, operating solely on monocular camera input. Our approach integrates VLMs’ cross-modal understanding with end-to-end learning, explicitly modeling driving intentions and scene semantics via natural language instructions. Contribution/Results: Evaluated on the CVPR 2024 End-to-End Driving Challenge, our method achieves first place in the pure-vision track (second overall), significantly outperforming prior vision-only approaches. This demonstrates VLMs’ strong generalization capability and potential for interpretable, semantically grounded decision-making in autonomous driving—without requiring multi-sensor fusion or explicit modular pipelines.
📝 Abstract
End-to-end autonomous driving has drawn tremendous attention recently. Many works focus on using modular deep neural networks to construct the end-to-end archi-tecture. However, whether using powerful large language models (LLM), especially multi-modality Vision Language Models (VLM) could benefit the end-to-end driving tasks remain a question. In our work, we demonstrate that combining end-to-end architectural design and knowledgeable VLMs yield impressive performance on the driving tasks. It is worth noting that our method only uses a single camera and is the best camera-only solution across the leaderboard, demonstrating the effectiveness of vision-based driving approach and the potential for end-to-end driving tasks.