π€ AI Summary
Existing surgical video object segmentation methods suffer from significant bottlenecks in real-time performance and long-term tracking robustness, failing to meet clinical interactive requirements. To address this, we propose a two-stage text-guided segmentation framework: (1) semantic-precise detection via SAM2 in the first stage; and (2) online adaptive tracking in the second stage, incorporating a confidence-aware initial frame selection strategy and a diversity-driven long-term memory mechanism, jointly optimized with a cross-modal spatiotemporal Mamba network. To our knowledge, this is the first method enabling real-time, temporally consistent, text-driven segmentation for surgical videos. It achieves an average 8.7% IoU improvement across multiple benchmarks, enhances long-term tracking stability by 42%, and attains an inference speed of 61.2 FPSβthereby overcoming the limitations of short-term tracking and computational inefficiency.
π Abstract
Surgical scene segmentation is critical in computer-assisted surgery and is vital for enhancing surgical quality and patient outcomes. Recently, referring surgical segmentation is emerging, given its advantage of providing surgeons with an interactive experience to segment the target object. However, existing methods are limited by low efficiency and short-term tracking, hindering their applicability in complex real-world surgical scenarios. In this paper, we introduce ReSurgSAM2, a two-stage surgical referring segmentation framework that leverages Segment Anything Model 2 to perform text-referred target detection, followed by tracking with reliable initial frame identification and diversity-driven long-term memory. For the detection stage, we propose a cross-modal spatial-temporal Mamba to generate precise detection and segmentation results. Based on these results, our credible initial frame selection strategy identifies the reliable frame for the subsequent tracking. Upon selecting the initial frame, our method transitions to the tracking stage, where it incorporates a diversity-driven memory mechanism that maintains a credible and diverse memory bank, ensuring consistent long-term tracking. Extensive experiments demonstrate that ReSurgSAM2 achieves substantial improvements in accuracy and efficiency compared to existing methods, operating in real-time at 61.2 FPS. Our code and datasets will be available at https://github.com/jinlab-imvr/ReSurgSAM2.