MVP: Motion Vector Propagation for Zero-Shot Video Object Detection

📅 2025-09-22

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

To address the excessive computational cost of frame-wise open-vocabulary detection in zero-shot video object detection, this paper proposes a lightweight, training-free, annotation-free, and fine-tuning-free propagation method. Our approach applies the OWLv2 detector only on sparse keyframes and propagates bounding boxes across frames using motion vectors extracted from the video compression domain. To enhance propagation robustness, we introduce three novel components: (i) 3×3 grid-based detection aggregation, (ii) region-growing validation, and (iii) single-class switching. A dynamic IoU threshold controls localization precision adaptively, enabling support for arbitrary open-vocabulary prompts. Evaluated on the ILSVRC2015-VID validation set, our method achieves mAP@0.5 = 0.609—matching near-full-frame detection performance while significantly outperforming tracker-based propagation baselines. To the best of our knowledge, this is the first fully unsupervised, open-vocabulary zero-shot video object detection framework.

Technology Category

Application Category

📝 Abstract

Running a large open-vocabulary (Open-vocab) detector on every video frame is accurate but expensive. We introduce a training-free pipeline that invokes OWLv2 only on fixed-interval keyframes and propagates detections to intermediate frames using compressed-domain motion vectors (MV). A simple 3x3 grid aggregation of motion vectors provides translation and uniform-scale updates, augmented with an area-growth check and an optional single-class switch. The method requires no labels, no fine-tuning, and uses the same prompt list for all open-vocabulary methods. On ILSVRC2015-VID (validation dataset), our approach (MVP) attains mAP@0.5=0.609 and mAP@[0.5:0.95]=0.316. At loose intersection-over-union (IoU) thresholds it remains close to framewise OWLv2-Large (0.747/0.721 at 0.2/0.3 versus 0.784/0.780), reflecting that coarse localization is largely preserved. Under the same keyframe schedule, MVP outperforms tracker-based propagation (MOSSE, KCF, CSRT) at mAP@0.5. A supervised reference (YOLOv12x) reaches 0.631 at mAP@0.5 but requires labeled training, whereas our method remains label-free and open-vocabulary. These results indicate that compressed-domain propagation is a practical way to reduce detector invocations while keeping strong zero-shot coverage in videos. Our code and models are available at https://github.com/microa/MVP.

Problem

Research questions and friction points this paper is trying to address.

Reducing computational cost of running large detectors on every video frame

Propagating object detections between frames without training or labels

Maintaining zero-shot open-vocabulary detection accuracy in videos

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses motion vectors for detection propagation

Applies grid aggregation for translation updates

Requires no training or labeled data

🔎 Similar Papers

No similar papers found.