OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model

📅 2025-03-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of tightly coupling multimodal perception, instruction understanding, and action generation in end-to-end autonomous driving. Methodologically: (1) it introduces a hierarchical vision–language alignment mechanism to unify 2D/3D visual representations with natural language semantics; (2) it proposes an autoregressive vehicle-agent-environment interaction model that jointly performs spatial perception and behavioral decision-making for trajectory planning; and (3) it builds upon open-source large vision-language models, integrating multimodal projection, structured visual token encoding, and 3D perception fusion modules. Evaluated on nuScenes, the framework achieves state-of-the-art performance on both open-loop trajectory prediction and driving-related question-answering tasks. It significantly improves adherence to high-level semantic instructions and enhances trajectory robustness—particularly in complex, dynamic driving scenarios.

Technology Category

Application Category

📝 Abstract
We present OpenDriveVLA, a Vision-Language Action (VLA) model designed for end-to-end autonomous driving. OpenDriveVLA builds upon open-source pre-trained large Vision-Language Models (VLMs) to generate reliable driving actions, conditioned on 3D environmental perception, ego vehicle states, and driver commands. To bridge the modality gap between driving visual representations and language embeddings, we propose a hierarchical vision-language alignment process, projecting both 2D and 3D structured visual tokens into a unified semantic space. Besides, OpenDriveVLA models the dynamic relationships between the ego vehicle, surrounding agents, and static road elements through an autoregressive agent-env-ego interaction process, ensuring both spatially and behaviorally informed trajectory planning. Extensive experiments on the nuScenes dataset demonstrate that OpenDriveVLA achieves state-of-the-art results across open-loop trajectory planning and driving-related question-answering tasks. Qualitative analyses further illustrate OpenDriveVLA's superior capability to follow high-level driving commands and robustly generate trajectories under challenging scenarios, highlighting its potential for next-generation end-to-end autonomous driving. We will release our code to facilitate further research in this domain.
Problem

Research questions and friction points this paper is trying to address.

Bridging vision-language-action gaps for autonomous driving
Modeling dynamic vehicle-agent-road interactions for trajectory planning
Achieving state-of-the-art performance in driving tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages pre-trained Vision-Language Models for driving
Aligns 2D/3D visual tokens via hierarchical process
Models dynamic agent-env-ego interactions autoregressively
🔎 Similar Papers
No similar papers found.