VLA-R: Vision-Language Action Retrieval toward Open-World End-to-End Autonomous Driving

📅 2025-11-15

📈 Citations: 0

✨ Influential: 0

career value

246K/year

🤖 AI Summary

To address insufficient generalization to unseen, unstructured outdoor environments in open-world end-to-end autonomous driving, this paper proposes a vision-language action retrieval framework. Methodologically, it leverages frozen multimodal large models (e.g., CLIP), employs a Q-Former bottleneck for fine-grained vision–language feature alignment and aggregation, and introduces vision–action contrastive learning to directly establish cross-modal mappings from perceptual inputs to executable actions—requiring no model fine-tuning and no environment-specific priors, thus enabling zero-shot action retrieval. Evaluated on a real robotic platform, the approach significantly enhances navigation robustness and exploratory capability under conditions of limited training data and entirely unseen scenes. It further offers strong interpretability and high deployment efficiency.

Technology Category

Application Category

📝 Abstract

Exploring open-world situations in an end-to-end manner is a promising yet challenging task due to the need for strong generalization capabilities. In particular, end-to-end autonomous driving in unstructured outdoor environments often encounters conditions that were unfamiliar during training. In this work, we present Vision-Language Action Retrieval (VLA-R), an open-world end-to-end autonomous driving (OW-E2EAD) framework that integrates open-world perception with a novel vision-action retrieval paradigm. We leverage a frozen vision-language model for open-world detection and segmentation to obtain multi-scale, prompt-guided, and interpretable perception features without domain-specific tuning. A Q-Former bottleneck aggregates fine-grained visual representations with language-aligned visual features, bridging perception and action domains. To learn transferable driving behaviors, we introduce a vision-action contrastive learning scheme that aligns vision-language and action embeddings for effective open-world reasoning and action retrieval. Our experiments on a real-world robotic platform demonstrate strong generalization and exploratory performance in unstructured, unseen environments, even with limited data. Demo videos are provided in the supplementary material.

Problem

Research questions and friction points this paper is trying to address.

Developing open-world end-to-end autonomous driving for unstructured environments

Integrating vision-language models with action retrieval for generalization

Aligning perception and action domains through contrastive learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-world detection using frozen vision-language model

Q-Former bottleneck bridges perception and action

Vision-action contrastive learning aligns embeddings

🔎 Similar Papers

No similar papers found.