🤖 AI Summary
Accurate 6-DoF pose estimation of unknown surgical instruments in robot-assisted minimally invasive surgery (RMIS) remains challenging; marker-based approaches suffer from occlusion and specular reflection, while supervised learning methods exhibit poor generalization and require extensive annotated data. Method: This work introduces zero-shot RGB-D pose estimation to RMIS for the first time. We enhance depth estimation robustness against occlusion via RAFT-Stereo and replace SAM with fine-tuned Mask R-CNN for more precise instrument segmentation. Crucially, our method requires no training data for target instruments and supports cross-instrument zero-shot generalization. Contribution/Results: On unseen instruments, our approach significantly outperforms FoundationPose. We establish the first zero-shot RGB-D pose estimation benchmark for RMIS, enabling a new paradigm for surgical navigation and autonomous control—characterized by high generalizability and minimal data dependency.
📝 Abstract
Accurate pose estimation of surgical tools in Robot-assisted Minimally Invasive Surgery (RMIS) is essential for surgical navigation and robot control. While traditional marker-based methods offer accuracy, they face challenges with occlusions, reflections, and tool-specific designs. Similarly, supervised learning methods require extensive training on annotated datasets, limiting their adaptability to new tools. Despite their success in other domains, zero-shot pose estimation models remain unexplored in RMIS for pose estimation of surgical instruments, creating a gap in generalising to unseen surgical tools. This paper presents a novel 6 Degrees of Freedom (DoF) pose estimation pipeline for surgical instruments, leveraging state-of-the-art zero-shot RGB-D models like the FoundationPose and SAM-6D. We advanced these models by incorporating vision-based depth estimation using the RAFT-Stereo method, for robust depth estimation in reflective and textureless environments. Additionally, we enhanced SAM-6D by replacing its instance segmentation module, Segment Anything Model (SAM), with a fine-tuned Mask R-CNN, significantly boosting segmentation accuracy in occluded and complex conditions. Extensive validation reveals that our enhanced SAM-6D surpasses FoundationPose in zero-shot pose estimation of unseen surgical instruments, setting a new benchmark for zero-shot RGB-D pose estimation in RMIS. This work enhances the generalisability of pose estimation for unseen objects and pioneers the application of RGB-D zero-shot methods in RMIS.