HQ-SMem: Video Segmentation and Tracking Using Memory Efficient Object Embedding With Selective Update and Self-Supervised Distillation Feedback

📅 2025-07-24

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing video object segmentation (VOS) methods suffer from limited accuracy and robustness in handling deformable or topologically changing objects, tracking drift, and long video sequences. To address these challenges, we propose an efficient and robust VOS framework comprising three key components: (1) a high-precision mask refinement mechanism built upon SAM-HQ to enhance initial segmentation quality; (2) a dynamic, intelligent memory strategy with selective keyframe storage for lightweight temporal modeling; and (3) online appearance model updating via self-supervised distillation with feedback-driven refinement, mitigating background confusion and drift. Our method consistently ranks among the top two on the VOTS/VOT-ST 2024 benchmarks and establishes new state-of-the-art performance on LVOS and the Long Video Dataset. It significantly improves segmentation accuracy and long-range consistency, particularly in multi-object, complex dynamic scenes.

Technology Category

Application Category

📝 Abstract

Video Object Segmentation (VOS) is foundational to numerous computer vision applications, including surveillance, autonomous driving, robotics and generative video editing. However, existing VOS models often struggle with precise mask delineation, deformable objects, topologically transforming objects, tracking drift and long video sequences. In this paper, we introduce HQ-SMem, for High Quality video segmentation and tracking using Smart Memory, a novel method that enhances the performance of VOS base models by addressing these limitations. Our approach incorporates three key innovations: (i) leveraging SAM with High-Quality masks (SAM-HQ) alongside appearance-based candidate-selection to refine coarse segmentation masks, resulting in improved object boundaries; (ii) implementing a dynamic smart memory mechanism that selectively stores relevant key frames while discarding redundant ones, thereby optimizing memory usage and processing efficiency for long-term videos; and (iii) dynamically updating the appearance model to effectively handle complex topological object variations and reduce drift throughout the video. These contributions mitigate several limitations of existing VOS models including, coarse segmentations that mix-in background pixels, fixed memory update schedules, brittleness to drift and occlusions, and prompt ambiguity issues associated with SAM. Extensive experiments conducted on multiple public datasets and state-of-the-art base trackers demonstrate that our method consistently ranks among the top two on VOTS and VOTSt 2024 datasets. Moreover, HQ-SMem sets new benchmarks on Long Video Dataset and LVOS, showcasing its effectiveness in challenging scenarios characterized by complex multi-object dynamics over extended temporal durations.

Problem

Research questions and friction points this paper is trying to address.

Improving video object segmentation mask precision and boundary quality

Optimizing memory usage for long video sequence processing

Enhancing tracking robustness for deformable and topologically changing objects

Innovation

Methods, ideas, or system contributions that make the work stand out.

SAM-HQ for refined object boundaries

Dynamic smart memory for efficiency

Dynamic appearance model updates

🔎 Similar Papers

No similar papers found.

Authors to Follow