VDT-Auto: End-to-end Autonomous Driving with VLM-Guided Diffusion Transformers

📅 2025-02-27

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address insufficient decision-making robustness of end-to-end autonomous driving in dynamic environments and long-tail scenarios, this paper proposes a geometry–semantics dual-conditioned diffusion Transformer for action generation. We innovatively integrate a vision-language model (VLM) to guide the diffusion process: its structured outputs steer the forward noising schedule, while bird’s-eye-view (BEV) features and text embeddings jointly constrain reverse denoising—enabling unified multimodal state-to-action modeling. Our approach encompasses VLM fine-tuning, a dedicated BEV encoder, a diffusion-based Transformer architecture, and a multimodal conditional generation mechanism. On the nuScenes open-loop planning benchmark, it achieves a mean L2 error of 0.52 m and a collision rate of 21%. Real-world vehicle validation further demonstrates strong generalization capability and practical deployment potential.

Technology Category

Application Category

📝 Abstract

In autonomous driving, dynamic environment and corner cases pose significant challenges to the robustness of ego vehicle's decision-making. To address these challenges, commencing with the representation of state-action mapping in the end-to-end autonomous driving paradigm, we introduce a novel pipeline, VDT-Auto. Leveraging the advancement of the state understanding of Visual Language Model (VLM), incorporating with diffusion Transformer-based action generation, our VDT-Auto parses the environment geometrically and contextually for the conditioning of the diffusion process. Geometrically, we use a bird's-eye view (BEV) encoder to extract feature grids from the surrounding images. Contextually, the structured output of our fine-tuned VLM is processed into textual embeddings and noisy paths. During our diffusion process, the added noise for the forward process is sampled from the noisy path output of the fine-tuned VLM, while the extracted BEV feature grids and embedded texts condition the reverse process of our diffusion Transformers. Our VDT-Auto achieved 0.52m on average L2 errors and 21% on average collision rate in the nuScenes open-loop planning evaluation. Moreover, the real-world demonstration exhibited prominent generalizability of our VDT-Auto. The code and dataset will be released after acceptance.

Problem

Research questions and friction points this paper is trying to address.

Enhances autonomous driving decision-making robustness

Integrates Visual Language Model with diffusion Transformers

Addresses dynamic environments and corner cases

Innovation

Methods, ideas, or system contributions that make the work stand out.

VLM-guided diffusion transformers

BEV encoder feature extraction

Textual embeddings condition diffusion

🔎 Similar Papers

No similar papers found.

Authors to Follow