Scene-Centric Unsupervised Video Panoptic Segmentation

πŸ“… 2026-06-03
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

168K/year
πŸ€– AI Summary
This work introduces the first unsupervised video panoptic segmentation task and proposes VideoCUPS, a method that detects, segments, and tracks all objects in videos without any human annotations. Leveraging depth, motion, and visual cues from scene-centric videos, VideoCUPS generates temporally consistent pseudo-labels and employs a novel Video DropLoss for optimized training. The study also establishes a comprehensive evaluation protocol and baseline models. Experimental results demonstrate that VideoCUPS significantly outperforms various baselines derived from state-of-the-art image or instance segmentation models, highlighting its strong performance in label-efficient learning.
πŸ“ Abstract
Video panoptic segmentation (VPS) aims to jointly detect, segment, and track all objects while partitioning the video into semantically consistent regions. We introduce the task setting of unsupervised VPS, omitting any human supervision. Existing unsupervised scene understanding works mainly focused on image segmentation tasks; the video domain remains underexplored. We propose VideoCUPS, the first unsupervised VPS approach. VideoCUPS generates temporally consistent panoptic video pseudo-labels from scene-centric videos by exploiting unsupervised depth, motion, and visual cues. Training on these pseudo-labels using a novel Video DropLoss yields an accurate, unsupervised VPS model. To benchmark progress, we introduce a comprehensive evaluation protocol and four competitive baselines, extending state-of-the-art unsupervised panoptic image and instance video segmentation models to VPS. VideoCUPS outperforms all baselines and demonstrates strong label-efficient learning. With VideoCUPS, our evaluation protocol, and baselines, we provide a strong foundation for future research on unsupervised VPS.
Problem

Research questions and friction points this paper is trying to address.

unsupervised video panoptic segmentation
scene-centric videos
temporal consistency
pseudo-labels
video segmentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

unsupervised video panoptic segmentation
VideoCUPS
pseudo-labeling
temporal consistency
Video DropLoss