PIG-Nav: Key Insights for Pretrained Image Goal Navigation Models

📅 2025-07-23

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the limited zero-shot and cross-environment generalization capability of visual navigation models. To this end, we propose PIG-Nav—a framework built upon a pretrained ViT image encoder, featuring an early-fusion architecture that jointly encodes egocentric observations and goal images, and incorporating a multi-task self-supervised learning mechanism to enhance cross-environment representation learning. Furthermore, we introduce a lightweight, automated game-video annotation pipeline to enable large-scale navigation pretraining. Extensive experiments across two simulation platforms and one real-world robot environment demonstrate that PIG-Nav achieves an average 22.6% improvement in zero-shot navigation performance. With only minimal fine-tuning data, it attains state-of-the-art results, outperforming prior methods by 37.5%. The framework significantly improves model transferability and practical applicability for real-world deployment.

Technology Category

Application Category

📝 Abstract

Recent studies have explored pretrained (foundation) models for vision-based robotic navigation, aiming to achieve generalizable navigation and positive transfer across diverse environments while enhancing zero-shot performance in unseen settings. In this work, we introduce PIG-Nav (Pretrained Image-Goal Navigation), a new approach that further investigates pretraining strategies for vision-based navigation models and contributes in two key areas. Model-wise, we identify two critical design choices that consistently improve the performance of pretrained navigation models: (1) integrating an early-fusion network structure to combine visual observations and goal images via appropriately pretrained Vision Transformer (ViT) image encoder, and (2) introducing suitable auxiliary tasks to enhance global navigation representation learning, thus further improving navigation performance. Dataset-wise, we propose a novel data preprocessing pipeline for efficiently labeling large-scale game video datasets for navigation model training. We demonstrate that augmenting existing open navigation datasets with diverse gameplay videos improves model performance. Our model achieves an average improvement of 22.6% in zero-shot settings and a 37.5% improvement in fine-tuning settings over existing visual navigation foundation models in two complex simulated environments and one real-world environment. These results advance the state-of-the-art in pretrained image-goal navigation models. Notably, our model maintains competitive performance while requiring significantly less fine-tuning data, highlighting its potential for real-world deployment with minimal labeled supervision.

Problem

Research questions and friction points this paper is trying to address.

Improving pretraining strategies for vision-based navigation models

Enhancing zero-shot performance in unseen navigation environments

Efficiently labeling large-scale datasets for navigation training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Early-fusion ViT encoder combines vision and goal

Auxiliary tasks enhance global navigation learning

Gameplay video data preprocessing improves training

🔎 Similar Papers

Advancing Object Goal Navigation Through LLM-enhanced Object Affinities Transfer

2024-03-15arXiv.orgCitations: 2

Authors to Follow