2nd Place Solution for CVPR2024 E2E Challenge: End-to-End Autonomous Driving Using Vision Language Model

📅 2025-09-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Pure vision-based end-to-end autonomous driving systems often suffer from limited reasoning capabilities in complex, dynamic scenarios. Method: To address this, we propose the first knowledge-augmented vision-language model (VLM) framework for end-to-end driving, operating solely on monocular camera input. Our approach integrates VLMs’ cross-modal understanding with end-to-end learning, explicitly modeling driving intentions and scene semantics via natural language instructions. Contribution/Results: Evaluated on the CVPR 2024 End-to-End Driving Challenge, our method achieves first place in the pure-vision track (second overall), significantly outperforming prior vision-only approaches. This demonstrates VLMs’ strong generalization capability and potential for interpretable, semantically grounded decision-making in autonomous driving—without requiring multi-sensor fusion or explicit modular pipelines.

Technology Category

Application Category

📝 Abstract
End-to-end autonomous driving has drawn tremendous attention recently. Many works focus on using modular deep neural networks to construct the end-to-end archi-tecture. However, whether using powerful large language models (LLM), especially multi-modality Vision Language Models (VLM) could benefit the end-to-end driving tasks remain a question. In our work, we demonstrate that combining end-to-end architectural design and knowledgeable VLMs yield impressive performance on the driving tasks. It is worth noting that our method only uses a single camera and is the best camera-only solution across the leaderboard, demonstrating the effectiveness of vision-based driving approach and the potential for end-to-end driving tasks.
Problem

Research questions and friction points this paper is trying to address.

Combining vision language models with end-to-end autonomous driving architecture
Investigating VLMs' potential for camera-only driving task performance
Developing single-camera solution for end-to-end autonomous driving
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combining end-to-end design with Vision Language Models
Using single camera for best camera-only solution
Leveraging multi-modality VLMs for driving tasks
🔎 Similar Papers
No similar papers found.
Z
Zilong Guo
ZERON, Shanghai, China
Y
Yi Luo
ZERON, Shanghai, China
L
Long Sha
ZERON, Shanghai, China
D
Dongxu Wang
ZERON, Shanghai, China
Panqu Wang
Panqu Wang
ZERON 零一汽车
Deep LearningAutonomous Driving
C
Chenyang Xu
ZERON, Shanghai, China
Y
Yi Yang
ZERON, Shanghai, China