🤖 AI Summary
To address the challenges of fusing heterogeneous perception inputs—onboard vision, roadside multimodal data, and natural language instructions—and insufficient robustness in trajectory planning within vehicle-infrastructure cooperative (V2X) systems, this paper proposes the first end-to-end V2X autonomous driving framework based on large vision-language models (VLMs). Methodologically, it unifies environmental understanding and trajectory generation into a single architecture, incorporates contrastive learning to enhance cross-modal representation robustness, and designs a multi-source heterogeneous data fusion module amenable to end-to-end training. The core contribution lies in the first systematic integration of VLMs into V2X cooperative driving, enabling semantic-level perception-decision integration. Evaluated on the DAIR-V2X benchmark, our approach significantly outperforms state-of-the-art methods. Comprehensive corner-case analysis further demonstrates superior generalization capability and robustness under real-world road deployment conditions.
📝 Abstract
Advancements in autonomous driving have increasingly focused on end-to-end (E2E) systems that manage the full spectrum of driving tasks, from environmental perception to vehicle navigation and control. This paper introduces V2X-VLM, an innovative E2E vehicle-infrastructure cooperative autonomous driving (VICAD) framework with Vehicle-to-Everything (V2X) systems and large vision-language models (VLMs). V2X-VLM is designed to enhance situational awareness, decision-making, and ultimate trajectory planning by integrating multimodel data from vehicle-mounted cameras, infrastructure sensors, and textual information. The contrastive learning method is further employed to complement VLM by refining feature discrimination, assisting the model to learn robust representations of the driving environment. Evaluations on the DAIR-V2X dataset show that V2X-VLM outperforms state-of-the-art cooperative autonomous driving methods, while additional tests on corner cases validate its robustness in real-world driving conditions.