🤖 AI Summary
Existing active learning methods sample data at the frame level, which mismatches the multi-frame clip-based training paradigm of end-to-end multi-object trackers, leading to suboptimal annotation efficiency. This work proposes the first clip-level active learning framework tailored for end-to-end multi-object tracking. It evaluates the informativeness of multi-frame clips by modeling cross-frame association uncertainty and incorporates a temporal diversity constraint to mitigate redundancy. By aligning more closely with the training requirements of modern end-to-end trackers, the proposed approach significantly outperforms existing baselines on MeMOTR and SambaMOTR, achieving near fully supervised performance with only 50% of the annotated data.
📝 Abstract
Multi-Object Tracking (MOT) in dynamic environments relies on robust temporal reasoning to maintain consistent object identities over time. Transformer-based end-to-end MOT models achieve strong performance by explicitly modeling temporal dependencies, yet training them requires extensive bounding-box and identity annotations. Given the high labeling cost and strong redundancy in videos, Active Learning (AL) is an effective approach to improve annotation efficiency. However, existing AL methods for MOT primarily operate at the frame level, which is structurally misaligned with modern end-to-end trackers whose inference and training rely on multi-frame clips. To bridge this gap, we formulate clip-level active learning and propose Clip-level Uncertainty and Temporal-aware Active Learning (CUTAL). In contrast to frame-based approaches, CUTAL scores each clip using uncertainty metrics derived from multi-frame predictions to capture inter-frame correspondence ambiguities, while enforcing temporal diversity to select an informative and non-redundant subset. Experiments show that CUTAL achieves stronger overall performance than baselines at the same label budgets across MeMOTR and SambaMOTR. Notably, CUTAL achieves performance comparable to full supervision for MeMOTR on both datasets using only 50% of the labeled training data.