🤖 AI Summary
To address insufficient generalization to unseen, unstructured outdoor environments in open-world end-to-end autonomous driving, this paper proposes a vision-language action retrieval framework. Methodologically, it leverages frozen multimodal large models (e.g., CLIP), employs a Q-Former bottleneck for fine-grained vision–language feature alignment and aggregation, and introduces vision–action contrastive learning to directly establish cross-modal mappings from perceptual inputs to executable actions—requiring no model fine-tuning and no environment-specific priors, thus enabling zero-shot action retrieval. Evaluated on a real robotic platform, the approach significantly enhances navigation robustness and exploratory capability under conditions of limited training data and entirely unseen scenes. It further offers strong interpretability and high deployment efficiency.
📝 Abstract
Exploring open-world situations in an end-to-end manner is a promising yet challenging task due to the need for strong generalization capabilities. In particular, end-to-end autonomous driving in unstructured outdoor environments often encounters conditions that were unfamiliar during training. In this work, we present Vision-Language Action Retrieval (VLA-R), an open-world end-to-end autonomous driving (OW-E2EAD) framework that integrates open-world perception with a novel vision-action retrieval paradigm. We leverage a frozen vision-language model for open-world detection and segmentation to obtain multi-scale, prompt-guided, and interpretable perception features without domain-specific tuning. A Q-Former bottleneck aggregates fine-grained visual representations with language-aligned visual features, bridging perception and action domains. To learn transferable driving behaviors, we introduce a vision-action contrastive learning scheme that aligns vision-language and action embeddings for effective open-world reasoning and action retrieval. Our experiments on a real-world robotic platform demonstrate strong generalization and exploratory performance in unstructured, unseen environments, even with limited data. Demo videos are provided in the supplementary material.