π€ AI Summary
This work addresses the limited zero-shot and cross-environment generalization capability of visual navigation models. To this end, we propose PIG-Navβa framework built upon a pretrained ViT image encoder, featuring an early-fusion architecture that jointly encodes egocentric observations and goal images, and incorporating a multi-task self-supervised learning mechanism to enhance cross-environment representation learning. Furthermore, we introduce a lightweight, automated game-video annotation pipeline to enable large-scale navigation pretraining. Extensive experiments across two simulation platforms and one real-world robot environment demonstrate that PIG-Nav achieves an average 22.6% improvement in zero-shot navigation performance. With only minimal fine-tuning data, it attains state-of-the-art results, outperforming prior methods by 37.5%. The framework significantly improves model transferability and practical applicability for real-world deployment.
π Abstract
Recent studies have explored pretrained (foundation) models for vision-based robotic navigation, aiming to achieve generalizable navigation and positive transfer across diverse environments while enhancing zero-shot performance in unseen settings. In this work, we introduce PIG-Nav (Pretrained Image-Goal Navigation), a new approach that further investigates pretraining strategies for vision-based navigation models and contributes in two key areas. Model-wise, we identify two critical design choices that consistently improve the performance of pretrained navigation models: (1) integrating an early-fusion network structure to combine visual observations and goal images via appropriately pretrained Vision Transformer (ViT) image encoder, and (2) introducing suitable auxiliary tasks to enhance global navigation representation learning, thus further improving navigation performance. Dataset-wise, we propose a novel data preprocessing pipeline for efficiently labeling large-scale game video datasets for navigation model training. We demonstrate that augmenting existing open navigation datasets with diverse gameplay videos improves model performance. Our model achieves an average improvement of 22.6% in zero-shot settings and a 37.5% improvement in fine-tuning settings over existing visual navigation foundation models in two complex simulated environments and one real-world environment. These results advance the state-of-the-art in pretrained image-goal navigation models. Notably, our model maintains competitive performance while requiring significantly less fine-tuning data, highlighting its potential for real-world deployment with minimal labeled supervision.