Beyond Appearance: Geometric Cues for Robust Video Instance Segmentation

📅 2025-07-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Video instance segmentation (VIS) suffers from limited robustness under occlusion, motion blur, and appearance variations. This paper pioneers a systematic investigation into monocular depth estimation as a geometric cue to enhance VIS performance, proposing three geometry-aware fusion paradigms: Depth-channel Expansion (EDC), Shared Vision Transformer (ViT) Backbone (SV), and Depth Supervision (DS). These methods improve temporal association and representation learning via multimodal feature alignment, operating seamlessly on transformer-based backbones (e.g., Swin-L) without requiring additional annotations. Evaluated on the OVIS benchmark, EDC achieves 56.2 AP—setting a new state-of-the-art at the time—and significantly outperforms prior approaches. The results empirically validate the effectiveness and generalizability of geometric priors in complex video scenes, advancing VIS toward multimodal, geometry-aware modeling.

Technology Category

Application Category

📝 Abstract
Video Instance Segmentation (VIS) fundamentally struggles with pervasive challenges including object occlusions, motion blur, and appearance variations during temporal association. To overcome these limitations, this work introduces geometric awareness to enhance VIS robustness by strategically leveraging monocular depth estimation. We systematically investigate three distinct integration paradigms. Expanding Depth Channel (EDC) method concatenates the depth map as input channel to segmentation networks; Sharing ViT (SV) designs a uniform ViT backbone, shared between depth estimation and segmentation branches; Depth Supervision (DS) makes use of depth prediction as an auxiliary training guide for feature learning. Though DS exhibits limited effectiveness, benchmark evaluations demonstrate that EDC and SV significantly enhance the robustness of VIS. When with Swin-L backbone, our EDC method gets 56.2 AP, which sets a new state-of-the-art result on OVIS benchmark. This work conclusively establishes depth cues as critical enablers for robust video understanding.
Problem

Research questions and friction points this paper is trying to address.

Enhance Video Instance Segmentation robustness using geometric cues
Address object occlusions, motion blur, appearance variations in VIS
Integrate monocular depth estimation to improve temporal association
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages monocular depth estimation for robustness
Expands depth channel in segmentation networks
Shares ViT backbone between depth and segmentation
🔎 Similar Papers
No similar papers found.
Quanzhu Niu
Quanzhu Niu
Wuhan University
computer visionMLLM
Y
Yikang Zhou
Wuhan University, China
S
Shihao Chen
Wuhan University, China
T
Tao Zhang
Wuhan University, China
S
Shunping Ji
Wuhan University, China