2nd Place Solution for CVPR2024 E2E Challenge: End-to-End Autonomous Driving Using Vision Language Model

📅 2025-09-02

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Pure vision-based end-to-end autonomous driving systems often suffer from limited reasoning capabilities in complex, dynamic scenarios. Method: To address this, we propose the first knowledge-augmented vision-language model (VLM) framework for end-to-end driving, operating solely on monocular camera input. Our approach integrates VLMs’ cross-modal understanding with end-to-end learning, explicitly modeling driving intentions and scene semantics via natural language instructions. Contribution/Results: Evaluated on the CVPR 2024 End-to-End Driving Challenge, our method achieves first place in the pure-vision track (second overall), significantly outperforming prior vision-only approaches. This demonstrates VLMs’ strong generalization capability and potential for interpretable, semantically grounded decision-making in autonomous driving—without requiring multi-sensor fusion or explicit modular pipelines.

Technology Category

Application Category

📝 Abstract

End-to-end autonomous driving has drawn tremendous attention recently. Many works focus on using modular deep neural networks to construct the end-to-end archi-tecture. However, whether using powerful large language models (LLM), especially multi-modality Vision Language Models (VLM) could benefit the end-to-end driving tasks remain a question. In our work, we demonstrate that combining end-to-end architectural design and knowledgeable VLMs yield impressive performance on the driving tasks. It is worth noting that our method only uses a single camera and is the best camera-only solution across the leaderboard, demonstrating the effectiveness of vision-based driving approach and the potential for end-to-end driving tasks.

Problem

Research questions and friction points this paper is trying to address.

Combining vision language models with end-to-end autonomous driving architecture

Investigating VLMs' potential for camera-only driving task performance

Developing single-camera solution for end-to-end autonomous driving

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combining end-to-end design with Vision Language Models

Using single camera for best camera-only solution

Leveraging multi-modality VLMs for driving tasks

🔎 Similar Papers

No similar papers found.

Authors to Follow