TSMS-SAM2: Multi-scale Temporal Sampling Augmentation and Memory-Splitting Pruning for Promptable Video Object Segmentation and Tracking in Surgical Scenarios

📅 2025-08-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limited robustness and efficiency of SAM2 in video object segmentation and tracking (VOST) for surgical videos—caused by complex object motion and redundant historical features—this paper proposes TSMS-SAM2. Methodologically: (1) a multi-scale temporal sampling strategy is introduced to better model rapid motion dynamics; (2) a memory splitting and dynamic pruning mechanism is designed to explicitly organize and compress historical frame features, mitigating memory redundancy. The framework preserves SAM2’s prompt-driven paradigm while significantly improving temporal consistency and segmentation accuracy. Evaluated on EndoVis2017 and EndoVis2018, TSMS-SAM2 achieves mean Dice scores of 95.24% and 86.73%, respectively—surpassing both existing SAM-based and dedicated VOST methods. These results validate its effectiveness and state-of-the-art performance in dynamic minimally invasive surgical scenarios.

Technology Category

Application Category

📝 Abstract
Promptable video object segmentation and tracking (VOST) has seen significant advances with the emergence of foundation models like Segment Anything Model 2 (SAM2); however, their application in surgical video analysis remains challenging due to complex motion dynamics and the redundancy of memory that impedes effective learning. In this work, we propose TSMS-SAM2, a novel framework that enhances promptable VOST in surgical videos by addressing challenges of rapid object motion and memory redundancy in SAM2. TSMS-SAM2 introduces two key strategies: multi-temporal-scale video sampling augmentation to improve robustness against motion variability, and a memory splitting and pruning mechanism that organizes and filters past frame features for more efficient and accurate segmentation. Evaluated on EndoVis2017 and EndoVis2018 datasets, TSMS-SAM2 achieved the highest mean Dice scores of 95.24 and 86.73, respectively, outperforming prior SAM-based and task-specific methods. Extensive ablation studies confirm the effectiveness of multiscale temporal augmentation and memory splitting, highlighting the framework's potential for robust, efficient segmentation in complex surgical scenarios. Our source code will be available at https://github.com/apple1986/TSMS-SAM2.
Problem

Research questions and friction points this paper is trying to address.

Enhance promptable VOST in surgical videos
Address rapid object motion challenges
Reduce memory redundancy in SAM2
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-temporal-scale video sampling augmentation
Memory splitting and pruning mechanism
Enhanced promptable VOST in surgical videos
🔎 Similar Papers
No similar papers found.
Guoping Xu
Guoping Xu
UTSW, WIT
Medical Image SegmentationDisease QuantificationComputer Vision
H
Hua-Chieh Shao
The Medical Artificial Intelligence and Automation (MAIA) Laboratory, Department of Radiation Oncology, University of Texas Southwestern Medical Center, Dallas, TX 75390, USA
Y
You Zhang
The Medical Artificial Intelligence and Automation (MAIA) Laboratory, Department of Radiation Oncology, University of Texas Southwestern Medical Center, Dallas, TX 75390, USA