Match4Annotate: Propagating Sparse Video Annotations via Implicit Neural Feature Matching

📅 2026-03-06

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the high cost of frame-by-frame annotation in medical imaging and the limitations of existing methods in simultaneously handling cross-video and intra-frame propagation, supporting both point and mask annotations, and ensuring spatiotemporal smoothness. To this end, we propose a lightweight test-time optimization framework that fits a SIREN-based implicit neural representation to DINOv3 features, constructing a continuous high-resolution spatiotemporal feature field. An inter-frame smooth implicit deformation field is introduced to guide accurate correspondence matching. Our approach is the first to unify support for both point and mask annotations across frames and videos within a single framework. Evaluated on three clinical ultrasound datasets, the method achieves state-of-the-art performance in cross-video propagation, significantly outperforming feature-matching and one-shot segmentation baselines, while attaining intra-frame propagation accuracy comparable to specialized trackers.

Technology Category

Application Category

📝 Abstract

Acquiring per-frame video annotations remains a primary bottleneck for deploying computer vision in specialized domains such as medical imaging, where expert labeling is slow and costly. Label propagation offers a natural solution, yet existing approaches face fundamental limitations. Video trackers and segmentation models can propagate labels within a single sequence but require per-video initialization and cannot generalize across videos. Classic correspondence pipelines operate on detector-chosen keypoints and struggle in low-texture scenes, while dense feature matching and one-shot segmentation methods enable cross-video propagation but lack spatiotemporal smoothness and unified support for both point and mask annotations. We present Match4Annotate, a lightweight framework for both intra-video and inter-video propagation of point and mask annotations. Our method fits a SIREN-based implicit neural representation to DINOv3 features at test time, producing a continuous, high-resolution spatiotemporal feature field, and learns a smooth implicit deformation field between frame pairs to guide correspondence matching. We evaluate on three challenging clinical ultrasound datasets. Match4Annotate achieves state-of-the-art inter-video propagation, outperforming feature matching and one-shot segmentation baselines, while remaining competitive with specialized trackers for intra-video propagation. Our results show that lightweight, test-time-optimized feature matching pipelines have the potential to offer an efficient and accessible solution for scalable annotation workflows.

Problem

Research questions and friction points this paper is trying to address.

video annotation propagation

cross-video generalization

spatiotemporal smoothness

point and mask annotations

medical imaging

Innovation

Methods, ideas, or system contributions that make the work stand out.

implicit neural representation

feature matching

annotation propagation

SIREN

DINOv3

🔎 Similar Papers

No similar papers found.

Authors to Follow