🤖 AI Summary
Current surgical video segmentation methods rely on manual initialization, hindering real-time clinical deployment. To address this, we propose a cross-patient frame initialization paradigm—introducing, for the first time, annotated frames from *other* patients as zero-shot initialization sources, thereby eliminating dependence on target-patient annotations. Our method builds upon a video object segmentation framework and integrates three key components: cross-patient feature transfer, frame-wise similarity assessment, and robust spatio-temporal alignment—enabling fully automatic, human-free target tracking initiation. Evaluated across multiple surgical video datasets under zero-shot settings, our approach achieves state-of-the-art performance (improving mean J&F score by 2.1%), significantly reduces manual intervention frequency, and demonstrates strong feasibility for clinical integration.
📝 Abstract
Video object segmentation is an emerging technology that is well-suited for real-time surgical video segmentation, offering valuable clinical assistance in the operating room by ensuring consistent frame tracking. However, its adoption is limited by the need for manual intervention to select the tracked object, making it impractical in surgical settings. In this work, we tackle this challenge with an innovative solution: using previously annotated frames from other patients as the tracking frames. We find that this unconventional approach can match or even surpass the performance of using patients' own tracking frames, enabling more autonomous and efficient AI-assisted surgical workflows. Furthermore, we analyze the benefits and limitations of this approach, highlighting its potential to enhance segmentation accuracy while reducing the need for manual input. Our findings provide insights into key factors influencing performance, offering a foundation for future research on optimizing cross-patient frame selection for real-time surgical video analysis.