CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving

๐Ÿ“… 2024-08-19
๐Ÿ›๏ธ IEEE Workshop/Winter Conference on Applications of Computer Vision
๐Ÿ“ˆ Citations: 30
โœจ Influential: 2
๐Ÿ“„ PDF
๐Ÿค– AI Summary
End-to-end path planning for autonomous driving in complex scenarios is hindered by the scarcity of large-scale, high-fidelity vision-language-action (VLA) aligned datasets. Method: We introduce the first real-world VLA dataset comprising over 80 hours of driving footage, establishing a scalable end-to-end VLA data generation paradigm. It achieves millisecond-level spatiotemporal synchronization from raw multi-sensor data and integrates large-model-driven semantic captioning with fine-grained action annotation to enable cross-modal joint modeling. Contribution/Results: The dataset significantly enhances modality coupling, annotation granularity, and scene diversity, establishing the first interpretable and evaluable multimodal training benchmark for autonomous driving. Multimodal large language models (MLLMs) trained on it demonstrate superior cross-modal alignment, coherent natural-language description generation, and precise action planning in open-road tasksโ€”yielding substantial improvements in generalization and operational reliability.

Technology Category

Application Category

๐Ÿ“ Abstract
Autonomous driving, particularly navigating complex and unanticipated scenarios, demands sophisticated reasoning and planning capabilities. While Multi-modal Large Language Models (MLLMs) offer a promising avenue for this, their use has been largely confined to understanding complex environmental contexts or generating high-level driving commands, with few studies extending their application to end-to-end path planning. A major research bottleneck is the lack of large-scale annotated datasets encompassing vision, language, and action. To address this issue, we propose CoVLA (Comprehensive Vision-Language-Action) Dataset, an extensive dataset comprising real-world driving videos spanning more than 80 hours. This dataset leverages a novel, scalable approach based on automated data processing and a caption generation pipeline to generate accurate driving trajectories paired with detailed natural language descriptions of driving environments and maneuvers. This approach utilizes raw in-vehicle sensor data, allowing it to surpass existing datasets in scale and annotation richness. Using CoVLA, we investigate the driving capabilities of MLLMs that can handle vision, language, and action in a variety of driving scenarios. Our results illustrate the strong proficiency of our model in generating coherent language and action outputs, emphasizing the potential of Vision-Language-Action (VLA) models in the field of autonomous driving. This dataset establishes a framework for robust, interpretable, and data-driven autonomous driving systems by providing a comprehensive platform for training and evaluating VLA models, contributing to safer and more reliable self-driving vehicles. The dataset is released for academic purpose.
Problem

Research questions and friction points this paper is trying to address.

Addressing the lack of large-scale vision-language-action datasets for autonomous driving.
Extending MLLMs from environmental understanding to end-to-end path planning.
Enabling robust training of models for complex driving scenarios and maneuvers.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated data processing for scalable dataset creation
Vision-language-action model integrating sensor data and descriptions
Generating driving trajectories with natural language captions
๐Ÿ”Ž Similar Papers
No similar papers found.