Robust Scene Transfer for PointGoal Navigation via Privileged Sensor Guided Contrastive Learning

📅 2026-06-03

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

This work addresses the performance degradation in PointGoal navigation caused by appearance and semantic discrepancies across environments. To mitigate this issue, the authors propose a sensor-guided adaptive contrastive learning framework that leverages privileged sensors such as LiDAR during training. By employing geometry-aware similarity metrics and adaptive temperature scaling, the framework steers the visual encoder to learn task-relevant structural features rather than scene-specific visual appearances. The pretrained encoder is then frozen and used as the perception backbone for reinforcement learning, effectively decoupling representation learning from policy optimization. Experimental results demonstrate that, using only monocular RGB input, the proposed method significantly outperforms large-scale pretrained models and standard contrastive learning baselines in high-fidelity simulation environments, achieving superior generalization across diverse indoor and outdoor navigation scenarios.

📝 Abstract

We propose a sensor-guided adaptive contrastive learning framework for visual representation learning in PointGoal navigation. During training, privileged LiDAR sensing guides the contrastive objective through a geometry-aware similarity metric and adaptive temperature scaling, encouraging visual embeddings to capture navigation-relevant structure rather than scene-specific appearance. The resulting encoder is pretrained independently, frozen, and used as the perceptual backbone for reinforcement learning, decoupling representation learning from policy optimization. We further introduce a cross-stage domain mismatch between representation pretraining and policy learning to suppress environment-specific shortcuts and promote reliance on task-relevant features. Extensive experiments in high-fidelity simulation demonstrate that our approach significantly improves policy-level scene transfer across diverse indoor and outdoor environments. At deployment, the agent relies only on monocular RGB observations together with standard task-related inputs such as goal position and proprioceptive signals, without access to LiDAR or other privileged sensors. Our method outperforms large pretrained vision models and standard contrastive baselines under severe appearance and semantic shifts. We also release a multimodal dataset to support future research on privileged-guided visual representation learning for navigation. The code is available at:

Problem

Research questions and friction points this paper is trying to address.

PointGoal navigation

scene transfer

visual representation learning

domain generalization

robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

privileged sensor

contrastive learning

visual representation learning