π€ AI Summary
To address the excessive computational cost of frame-wise open-vocabulary detection in zero-shot video object detection, this paper proposes a lightweight, training-free, annotation-free, and fine-tuning-free propagation method. Our approach applies the OWLv2 detector only on sparse keyframes and propagates bounding boxes across frames using motion vectors extracted from the video compression domain. To enhance propagation robustness, we introduce three novel components: (i) 3Γ3 grid-based detection aggregation, (ii) region-growing validation, and (iii) single-class switching. A dynamic IoU threshold controls localization precision adaptively, enabling support for arbitrary open-vocabulary prompts. Evaluated on the ILSVRC2015-VID validation set, our method achieves mAP@0.5 = 0.609βmatching near-full-frame detection performance while significantly outperforming tracker-based propagation baselines. To the best of our knowledge, this is the first fully unsupervised, open-vocabulary zero-shot video object detection framework.
π Abstract
Running a large open-vocabulary (Open-vocab) detector on every video frame is accurate but expensive. We introduce a training-free pipeline that invokes OWLv2 only on fixed-interval keyframes and propagates detections to intermediate frames using compressed-domain motion vectors (MV). A simple 3x3 grid aggregation of motion vectors provides translation and uniform-scale updates, augmented with an area-growth check and an optional single-class switch. The method requires no labels, no fine-tuning, and uses the same prompt list for all open-vocabulary methods. On ILSVRC2015-VID (validation dataset), our approach (MVP) attains mAP@0.5=0.609 and mAP@[0.5:0.95]=0.316. At loose intersection-over-union (IoU) thresholds it remains close to framewise OWLv2-Large (0.747/0.721 at 0.2/0.3 versus 0.784/0.780), reflecting that coarse localization is largely preserved. Under the same keyframe schedule, MVP outperforms tracker-based propagation (MOSSE, KCF, CSRT) at mAP@0.5. A supervised reference (YOLOv12x) reaches 0.631 at mAP@0.5 but requires labeled training, whereas our method remains label-free and open-vocabulary. These results indicate that compressed-domain propagation is a practical way to reduce detector invocations while keeping strong zero-shot coverage in videos. Our code and models are available at https://github.com/microa/MVP.