VDT-Auto: End-to-end Autonomous Driving with VLM-Guided Diffusion Transformers

📅 2025-02-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address insufficient decision-making robustness of end-to-end autonomous driving in dynamic environments and long-tail scenarios, this paper proposes a geometry–semantics dual-conditioned diffusion Transformer for action generation. We innovatively integrate a vision-language model (VLM) to guide the diffusion process: its structured outputs steer the forward noising schedule, while bird’s-eye-view (BEV) features and text embeddings jointly constrain reverse denoising—enabling unified multimodal state-to-action modeling. Our approach encompasses VLM fine-tuning, a dedicated BEV encoder, a diffusion-based Transformer architecture, and a multimodal conditional generation mechanism. On the nuScenes open-loop planning benchmark, it achieves a mean L2 error of 0.52 m and a collision rate of 21%. Real-world vehicle validation further demonstrates strong generalization capability and practical deployment potential.

Technology Category

Application Category

📝 Abstract
In autonomous driving, dynamic environment and corner cases pose significant challenges to the robustness of ego vehicle's decision-making. To address these challenges, commencing with the representation of state-action mapping in the end-to-end autonomous driving paradigm, we introduce a novel pipeline, VDT-Auto. Leveraging the advancement of the state understanding of Visual Language Model (VLM), incorporating with diffusion Transformer-based action generation, our VDT-Auto parses the environment geometrically and contextually for the conditioning of the diffusion process. Geometrically, we use a bird's-eye view (BEV) encoder to extract feature grids from the surrounding images. Contextually, the structured output of our fine-tuned VLM is processed into textual embeddings and noisy paths. During our diffusion process, the added noise for the forward process is sampled from the noisy path output of the fine-tuned VLM, while the extracted BEV feature grids and embedded texts condition the reverse process of our diffusion Transformers. Our VDT-Auto achieved 0.52m on average L2 errors and 21% on average collision rate in the nuScenes open-loop planning evaluation. Moreover, the real-world demonstration exhibited prominent generalizability of our VDT-Auto. The code and dataset will be released after acceptance.
Problem

Research questions and friction points this paper is trying to address.

Enhances autonomous driving decision-making robustness
Integrates Visual Language Model with diffusion Transformers
Addresses dynamic environments and corner cases
Innovation

Methods, ideas, or system contributions that make the work stand out.

VLM-guided diffusion transformers
BEV encoder feature extraction
Textual embeddings condition diffusion
🔎 Similar Papers
No similar papers found.
Z
Ziang Guo
Intelligent Space Robotics Laboratory, Center for Digital Engineering, Skolkovo Institute of Science and Technology, Moscow, Russia
Konstantin Gubernatorov
Konstantin Gubernatorov
Skolkovo Institute of Science and Technology
RoboticsVLASLAMWarehouse AutomationMulti-Agent Task Allocation
S
Selamawit Asfaw
Intelligent Space Robotics Laboratory, Center for Digital Engineering, Skolkovo Institute of Science and Technology, Moscow, Russia
Zakhar Yagudin
Zakhar Yagudin
Student, Skoltech
Self-drivingAutonomous carsComputer VisionControl Theory
Dzmitry Tsetserukou
Dzmitry Tsetserukou
Associate Professor, Skolkovo Institute of Science and Technology (Skoltech)
RoboticsHapticsUAV SwarmAIVR