TSMS-SAM2: Multi-scale Temporal Sampling Augmentation and Memory-Splitting Pruning for Promptable Video Object Segmentation and Tracking in Surgical Scenarios
By: Guoping Xu, Hua-Chieh Shao, You Zhang
Potential Business Impact:
Helps robots see and follow moving body parts in surgery.
Promptable video object segmentation and tracking (VOST) has seen significant advances with the emergence of foundation models like Segment Anything Model 2 (SAM2); however, their application in surgical video analysis remains challenging due to complex motion dynamics and the redundancy of memory that impedes effective learning. In this work, we propose TSMS-SAM2, a novel framework that enhances promptable VOST in surgical videos by addressing challenges of rapid object motion and memory redundancy in SAM2. TSMS-SAM2 introduces two key strategies: multi-temporal-scale video sampling augmentation to improve robustness against motion variability, and a memory splitting and pruning mechanism that organizes and filters past frame features for more efficient and accurate segmentation. Evaluated on EndoVis2017 and EndoVis2018 datasets, TSMS-SAM2 achieved the highest mean Dice scores of 95.24 and 86.73, respectively, outperforming prior SAM-based and task-specific methods. Extensive ablation studies confirm the effectiveness of multiscale temporal augmentation and memory splitting, highlighting the framework's potential for robust, efficient segmentation in complex surgical scenarios. Our source code will be available at https://github.com/apple1986/TSMS-SAM2.
Similar Papers
Memory-Augmented SAM2 for Training-Free Surgical Video Segmentation
CV and Pattern Recognition
Helps robots see and track tools in surgery.
Evaluating SAM2 for Video Semantic Segmentation
CV and Pattern Recognition
Lets computers perfectly cut out any object in videos.
Fast SAM2 with Text-Driven Token Pruning
CV and Pattern Recognition
Makes videos easier to edit by focusing on important parts.