🤖 AI Summary
To address insufficient decision-making robustness of end-to-end autonomous driving in dynamic environments and long-tail scenarios, this paper proposes a geometry–semantics dual-conditioned diffusion Transformer for action generation. We innovatively integrate a vision-language model (VLM) to guide the diffusion process: its structured outputs steer the forward noising schedule, while bird’s-eye-view (BEV) features and text embeddings jointly constrain reverse denoising—enabling unified multimodal state-to-action modeling. Our approach encompasses VLM fine-tuning, a dedicated BEV encoder, a diffusion-based Transformer architecture, and a multimodal conditional generation mechanism. On the nuScenes open-loop planning benchmark, it achieves a mean L2 error of 0.52 m and a collision rate of 21%. Real-world vehicle validation further demonstrates strong generalization capability and practical deployment potential.
📝 Abstract
In autonomous driving, dynamic environment and corner cases pose significant challenges to the robustness of ego vehicle's decision-making. To address these challenges, commencing with the representation of state-action mapping in the end-to-end autonomous driving paradigm, we introduce a novel pipeline, VDT-Auto. Leveraging the advancement of the state understanding of Visual Language Model (VLM), incorporating with diffusion Transformer-based action generation, our VDT-Auto parses the environment geometrically and contextually for the conditioning of the diffusion process. Geometrically, we use a bird's-eye view (BEV) encoder to extract feature grids from the surrounding images. Contextually, the structured output of our fine-tuned VLM is processed into textual embeddings and noisy paths. During our diffusion process, the added noise for the forward process is sampled from the noisy path output of the fine-tuned VLM, while the extracted BEV feature grids and embedded texts condition the reverse process of our diffusion Transformers. Our VDT-Auto achieved 0.52m on average L2 errors and 21% on average collision rate in the nuScenes open-loop planning evaluation. Moreover, the real-world demonstration exhibited prominent generalizability of our VDT-Auto. The code and dataset will be released after acceptance.