🤖 AI Summary
To address the challenge of zero-shot recognition and segmentation of unseen objects by service robots in unstructured indoor environments, this paper proposes the first three-stage cascaded framework that integrates SAM’s zero-shot mask generation capability with self-supervised ViT-based explicit visual representations. The method requires neither category annotations nor real-scene training data. It jointly processes RGB-D inputs, refines features via attention-driven mechanisms, performs K-Medoids clustering on feature embeddings, and generates point prompts to effectively bridge the simulation-to-reality domain gap. Evaluated on multiple complex indoor benchmarks—including cabinets, drawers, and handheld objects—the approach achieves state-of-the-art performance in zero-shot instance segmentation, significantly improving segmentation accuracy and cross-scene robustness. This work establishes a deployable, annotation-free paradigm for real-world service robotics.
📝 Abstract
Service robots operating in unstructured environments must effectively recognize and segment unknown objects to enhance their functionality. Traditional supervised learningbased segmentation techniques require extensive annotated datasets, which are impractical for the diversity of objects encountered in real-world scenarios. Unseen Object Instance Segmentation (UOIS) methods aim to address this by training models on synthetic data to generalize to novel objects, but they often suffer from the simulation-to-reality gap. This paper proposes a novel approach (ZISVFM) for solving UOIS by leveraging the powerful zero-shot capability of the segment anything model (SAM) and explicit visual representations from a selfsupervised vision transformer (ViT). The proposed framework operates in three stages: (1) generating object-agnostic mask proposals from colorized depth images using SAM, (2) refining these proposals using attention-based features from the selfsupervised ViT to filter non-object masks, and (3) applying K-Medoids clustering to generate point prompts that guide SAM towards precise object segmentation. Experimental validation on two benchmark datasets and a self-collected dataset demonstrates the superior performance of ZISVFM in complex environments, including hierarchical settings such as cabinets, drawers, and handheld objects. Our source code is available at https://github.com/Yinmlmaoliang/zisvfm.