π€ AI Summary
This work addresses the significant performance degradation of robotic semantic segmentation when deployment environments diverge from training data distributions, a challenge exacerbated by the sensitivity of existing unsupervised domain adaptation (UDA) methods to cross-view instance-level inconsistencies. To overcome this limitation, the paper introduces, for the first time, the zero-shot instance segmentation capability of foundation models into a UDA framework. By leveraging 3D voxel maps constructed by robots, the method generates multi-view consistent pseudo-labels and refines their quality through instance-level consistency constraints, enabling self-supervised fine-tuning without target-domain annotations. Experiments on real-world datasets demonstrate that the proposed approach substantially outperforms state-of-the-art multi-view consistency methods, significantly enhancing model generalization and segmentation accuracy in the target domain.
π Abstract
Semantic segmentation networks, which are essential for robotic perception, often suffer from performance degradation when the visual distribution of the deployment environment differs from that of the source dataset on which they were trained. Unsupervised Domain Adaptation (UDA) addresses this challenge by adapting the network to the robot's target environment without external supervision, leveraging the large amounts of data a robot might naturally collect during long-term operation. In such settings, UDA methods can exploit multi-view consistency across the environment's map to fine-tune the model in an unsupervised fashion and mitigate domain shift. However, these approaches remain sensitive to cross-view instance-level inconsistencies. In this work, we propose a method that starts from a volumetric 3D map to generate multi-view consistent pseudo-labels. We then refine these labels using the zero-shot instance segmentation capabilities of a foundation model, enforcing instance-level coherence. The refined annotations serve as supervision for self-supervised fine-tuning, enabling the robot to adapt its perception system at deployment time. Experiments on real-world data demonstrate that our approach consistently improves performance over state-of-the-art UDA baselines based on multi-view consistency, without requiring any ground-truth labels in the target domain.