🤖 AI Summary
Existing interactive video object segmentation (iVOS) methods suffer from reliance on unimodal input and inefficient propagation mechanisms, hindering simultaneous support for expressive user intent, real-time responsiveness, and multi-object handling. To address these limitations, we propose an ID-queried concurrent propagation framework featuring two key innovations: a novel Across-Frame Interaction (AFI) module enabling cross-frame scribble propagation, and a truncated re-propagation strategy supporting multi-object ID queries and inter-round memory retention. Our architecture integrates Swin Transformer and ResNet backbones, achieving a favorable trade-off between lightweight design and performance. On DAVIS 2017, SwinB-IDPro achieves a new state-of-the-art J&F score of 89.6% at 60 FPS; R50-IDPro attains over 3× faster inference than prior art in multi-object scenarios, while maintaining high accuracy, low latency, and memory efficiency.
📝 Abstract
Interactive Video Object Segmentation (iVOS) is inherently demanding, requiring real-time interaction between humans and computers. Enhancing user experience involves considerations such as user input habits, segmentation quality, running time, and memory consumption. However, existing methods compromise user experience by employing a single input mode and exhibiting slow running speeds. Specifically, these approaches restrict user interaction to a single frame, limiting the expression of user intent. To overcome these limitations and better align with user habits, we introduce a framework that facilitates flexible input modes by ID-queried concurrent propagation (IDPro). In particular, we have devised the Across-Frame Interaction Module (AFI), allowing users to freely annotate various objects across multiple frames. The AFI module transfers scribble information across interactive frames, generating multi-frame masks. Additionally, we leverage an id-queried mechanism to process multiple objects. To achieve more efficient propagation and a lightweight model, we propose a truncated re-propagation strategy, replacing the previous multi-round fusion module, which employs an across-round memory that stores crucial interaction information. Our SwinB-IDPro attains a new state-of-the-art performance on DAVIS 2017 (89.6%, ${mathcal {J}}& {mathcal {F}} ext{@}60$ ). Furthermore, our R50-IDPro exhibits over ${3 imes }$ faster performance than the leading competitor in challenging multi-object scenarios.