🤖 AI Summary
This work addresses the challenging unsupervised domain adaptation problem in Sim2Real semantic segmentation, where no annotations are available in the target (real-world) domain. To tackle this, we introduce the Segment Anything Model (SAM) for the first time into this setting, leveraging its zero-shot capability to generate robust, overlapping initial segmentation masks for real images. We further propose an invariance-variance contrastive regularization loss that explicitly enforces inter-mask feature consistency and discriminability, effectively mitigating SAM’s inherent oversegmentation and mask overlap issues. Our method operates entirely without target-domain labels. On YCB-Video and HomebrewedDB benchmarks, it significantly outperforms existing unsupervised approaches; notably, on YCB-Video, it achieves an mIoU surpassing that of a fully supervised baseline. Extensive validation in real robotic grasping scenarios demonstrates strong generalization capability.
📝 Abstract
Domain adaptation is especially important for robotics applications, where target domain training data is usually scarce and annotations are costly to obtain. We present a method for self-supervised domain adaptation for the scenario where annotated source domain data (e.g. from synthetic generation) is available, but the target domain data is completely unannotated. Our method targets the semantic segmentation task and leverages a segmentation foundation model (Segment Anything Model) to obtain segment information on unannotated data. We take inspiration from recent advances in unsupervised local feature learning and propose an invariance-variance loss over the detected segments for regularizing feature representations in the target domain. Crucially, this loss structure and network architecture can handle overlapping segments and oversegmentation as produced by Segment Anything. We demonstrate the advantage of our method on the challenging YCB-Video and HomebrewedDB datasets and show that it outperforms prior work and, on YCB-Video, even a network trained with real annotations. Additionally, we provide insight through model ablations and show applicability to a custom robotic application.