🤖 AI Summary
Existing RefVOS methods rely on hand-crafted heuristic sampling or external keyframe models, struggling to balance temporal modeling accuracy with architectural simplicity. This paper proposes an end-to-end trainable, LLM-driven framework addressing this limitation. Our method introduces three core innovations: (1) a moment-centric sampling strategy that explicitly models temporal alignment between language expressions and video segments; (2) a bidirectional anchor update propagation mechanism enabling precise key-segment localization and motion-detail preservation—without external models; and (3) a unified temporal similarity matching scheme leveraging [FIND] tokens, combined with dense-sparse hybrid sampling and dynamic anchor optimization. Evaluated on multiple benchmarks, our approach achieves significant improvements in segmentation accuracy (mAP ↑3.2%) and temporal localization (tIoU@0.5 ↑5.7%), while maintaining high inference efficiency and robustness.
📝 Abstract
Referring Video Object Segmentation (RefVOS) seeks to segment target objects in videos guided by natural language descriptions, demanding both temporal reasoning and fine-grained visual comprehension. Existing sampling strategies for LLM-based approaches typically rely on either handcrafted heuristics or external keyframe models. The former often overlooks essential temporal cues, while the latter increases system complexity. To address this, we propose a unified framework that jointly optimizes Temporal Sentence Grounding (TSG) and RefVOS, naturally incorporating key moment grounding capability. During training, we introduce a novel TSG paradigm that employs a dedicated exttt{[FIND]} token for key moment identification through temporal token similarity matching, thereby avoiding the need for external timestamp encodings. For inference, we design a Moment-Centric Sampling (MCS) strategy that densely samples informative moments while sparsely sampling non-essential frames, preserving both motion details and global context. To further enhance tracking stability, we develop Bidirectional Anchor-updated Propagation (BAP), which leverages the most relevant moment as start point for high-quality mask initialization and dynamically updates at sampled points to mitigate accumulated errors. Code and model will be available at: https://github.com/Dmmm1997/MomentSeg