TSMS-SAM2: Multi-scale Temporal Sampling Augmentation and Memory-Splitting Pruning for Promptable Video Object Segmentation and Tracking in Surgical Scenarios

📅 2025-08-07

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

To address the limited robustness and efficiency of SAM2 in video object segmentation and tracking (VOST) for surgical videos—caused by complex object motion and redundant historical features—this paper proposes TSMS-SAM2. Methodologically: (1) a multi-scale temporal sampling strategy is introduced to better model rapid motion dynamics; (2) a memory splitting and dynamic pruning mechanism is designed to explicitly organize and compress historical frame features, mitigating memory redundancy. The framework preserves SAM2’s prompt-driven paradigm while significantly improving temporal consistency and segmentation accuracy. Evaluated on EndoVis2017 and EndoVis2018, TSMS-SAM2 achieves mean Dice scores of 95.24% and 86.73%, respectively—surpassing both existing SAM-based and dedicated VOST methods. These results validate its effectiveness and state-of-the-art performance in dynamic minimally invasive surgical scenarios.

Technology Category

Application Category

📝 Abstract

Promptable video object segmentation and tracking (VOST) has seen significant advances with the emergence of foundation models like Segment Anything Model 2 (SAM2); however, their application in surgical video analysis remains challenging due to complex motion dynamics and the redundancy of memory that impedes effective learning. In this work, we propose TSMS-SAM2, a novel framework that enhances promptable VOST in surgical videos by addressing challenges of rapid object motion and memory redundancy in SAM2. TSMS-SAM2 introduces two key strategies: multi-temporal-scale video sampling augmentation to improve robustness against motion variability, and a memory splitting and pruning mechanism that organizes and filters past frame features for more efficient and accurate segmentation. Evaluated on EndoVis2017 and EndoVis2018 datasets, TSMS-SAM2 achieved the highest mean Dice scores of 95.24 and 86.73, respectively, outperforming prior SAM-based and task-specific methods. Extensive ablation studies confirm the effectiveness of multiscale temporal augmentation and memory splitting, highlighting the framework's potential for robust, efficient segmentation in complex surgical scenarios. Our source code will be available at https://github.com/apple1986/TSMS-SAM2.

Problem

Research questions and friction points this paper is trying to address.

Enhance promptable VOST in surgical videos

Address rapid object motion challenges

Reduce memory redundancy in SAM2

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-temporal-scale video sampling augmentation

Memory splitting and pruning mechanism

Enhanced promptable VOST in surgical videos

🔎 Similar Papers

Surgical SAM 2: Real-time Segment Anything in Surgical Video by Efficient Frame Pruning