Match4Annotate: Propagating Sparse Video Annotations via Implicit Neural Feature Matching

📅 2026-03-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high cost of frame-by-frame annotation in medical imaging and the limitations of existing methods in simultaneously handling cross-video and intra-frame propagation, supporting both point and mask annotations, and ensuring spatiotemporal smoothness. To this end, we propose a lightweight test-time optimization framework that fits a SIREN-based implicit neural representation to DINOv3 features, constructing a continuous high-resolution spatiotemporal feature field. An inter-frame smooth implicit deformation field is introduced to guide accurate correspondence matching. Our approach is the first to unify support for both point and mask annotations across frames and videos within a single framework. Evaluated on three clinical ultrasound datasets, the method achieves state-of-the-art performance in cross-video propagation, significantly outperforming feature-matching and one-shot segmentation baselines, while attaining intra-frame propagation accuracy comparable to specialized trackers.

Technology Category

Application Category

📝 Abstract
Acquiring per-frame video annotations remains a primary bottleneck for deploying computer vision in specialized domains such as medical imaging, where expert labeling is slow and costly. Label propagation offers a natural solution, yet existing approaches face fundamental limitations. Video trackers and segmentation models can propagate labels within a single sequence but require per-video initialization and cannot generalize across videos. Classic correspondence pipelines operate on detector-chosen keypoints and struggle in low-texture scenes, while dense feature matching and one-shot segmentation methods enable cross-video propagation but lack spatiotemporal smoothness and unified support for both point and mask annotations. We present Match4Annotate, a lightweight framework for both intra-video and inter-video propagation of point and mask annotations. Our method fits a SIREN-based implicit neural representation to DINOv3 features at test time, producing a continuous, high-resolution spatiotemporal feature field, and learns a smooth implicit deformation field between frame pairs to guide correspondence matching. We evaluate on three challenging clinical ultrasound datasets. Match4Annotate achieves state-of-the-art inter-video propagation, outperforming feature matching and one-shot segmentation baselines, while remaining competitive with specialized trackers for intra-video propagation. Our results show that lightweight, test-time-optimized feature matching pipelines have the potential to offer an efficient and accessible solution for scalable annotation workflows.
Problem

Research questions and friction points this paper is trying to address.

video annotation propagation
cross-video generalization
spatiotemporal smoothness
point and mask annotations
medical imaging
Innovation

Methods, ideas, or system contributions that make the work stand out.

implicit neural representation
feature matching
annotation propagation
SIREN
DINOv3
🔎 Similar Papers
No similar papers found.
Z
Zhuorui Zhang
Department of Mechanical Engineering, Massachusetts Institute of Technology, Cambridge, USA
R
Roger Pallarès-López
Department of Mechanical Engineering, Massachusetts Institute of Technology, Cambridge, USA
Praneeth Namburi
Praneeth Namburi
MIT.nano Immersion Lab, MIT
NeuroscienceBiomechanics
B
Brian W. Anthony
Department of Mechanical Engineering, Massachusetts Institute of Technology, Cambridge, USA; Institute for Medical Engineering and Science, Massachusetts Institute of Technology, Cambridge, USA; MIT.nano Immersion Lab, Massachusetts Institute of Technology, Cambridge, USA