Adapting Pre-Trained Vision Models for Novel Instance Detection and Segmentation

๐Ÿ“… 2024-05-28
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This paper addresses few-shot novel instance detection and segmentation. We propose NIDS-Net, a unified framework that jointly leverages Grounding DINO for high-quality candidate box generation and SAM for mask proposal. Instance embeddings are constructed by averaging patch-level features from DINOv2โ€™s ViT backbone, and a novel lightweight learnable weight adapter is introduced to locally refine the embedding spaceโ€”mitigating overfitting and enhancing inter-class discriminability. Furthermore, we formulate an end-to-end trainable template-proposal embedding matching paradigm. Experiments demonstrate state-of-the-art performance on four detection benchmarks. On seven RGB-only segmentation datasets from the BOP challenge, NIDS-Net outperforms all existing pure RGB methods and matches the accuracy of top-performing RGB-D approaches. Real-world validation is conducted on a Fetch robot platform and with a RealSense camera, confirming robustness in practical deployment scenarios.

Technology Category

Application Category

๐Ÿ“ Abstract
Novel Instance Detection and Segmentation (NIDS) aims at detecting and segmenting novel object instances given a few examples of each instance. We propose a unified, simple, yet effective framework (NIDS-Net) comprising object proposal generation, embedding creation for both instance templates and proposal regions, and embedding matching for instance label assignment. Leveraging recent advancements in large vision methods, we utilize Grounding DINO and Segment Anything Model (SAM) to obtain object proposals with accurate bounding boxes and masks. Central to our approach is the generation of high-quality instance embeddings. We utilized foreground feature averages of patch embeddings from the DINOv2 ViT backbone, followed by refinement through a weight adapter mechanism that we introduce. We show experimentally that our weight adapter can adjust the embeddings locally within their feature space and effectively limit overfitting in the few-shot setting. Furthermore, the weight adapter optimizes weights to enhance the distinctiveness of instance embeddings during similarity computation. This methodology enables a straightforward matching strategy that results in significant performance gains. Our framework surpasses current state-of-the-art methods, demonstrating notable improvements in four detection datasets. In the segmentation tasks on seven core datasets of the BOP challenge, our method outperforms the leading published RGB methods and remains competitive with the best RGB-D method. We have also verified our method using real-world images from a Fetch robot and a RealSense camera. Project Page: https://irvlutd.github.io/NIDSNet/
Problem

Research questions and friction points this paper is trying to address.

Detect and segment novel object instances with few examples.
Generate high-quality instance embeddings using DINOv2 ViT backbone.
Improve performance in detection and segmentation tasks significantly.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes Grounding DINO and SAM for object proposals
Introduces weight adapter for embedding refinement
Achieves state-of-the-art performance in detection and segmentation
๐Ÿ”Ž Similar Papers
No similar papers found.
Y
Ya Lu
Department of Computer Science, University of Texas at Dallas
J
Jishnu Jaykumar
Department of Computer Science, University of Texas at Dallas
Yunhui Guo
Yunhui Guo
UT Dallas
Computer VisionMachine LearningEdge Computing
Nicholas Ruozzi
Nicholas Ruozzi
University of Texas at Dallas
graphical modelsoptimizationmachine learning
Y
Yu Xiang
Department of Computer Science, University of Texas at Dallas