🤖 AI Summary
To address label noise in object detection—primarily arising from annotator subjectivity, especially under ambiguous boundaries or occlusion—this paper proposes a lightweight and robust training paradigm. Methodologically, it first leverages foundation models (e.g., SAM or CLIP) to pre-correct raw bounding-box annotations, thereby mitigating boundary uncertainty. It then integrates a multiple-instance learning (MIL) framework to relax reliance on precise box-level supervision. Finally, only the detection head is fine-tuned, enabling efficient adaptation in few-shot settings. The approach achieves state-of-the-art performance on both standard and few-shot detection benchmarks, significantly outperforming existing label-noise mitigation methods. Its key contributions include: (i) the first use of foundation models for annotation pre-correction in detection; (ii) synergistic coupling of MIL with foundation-model-guided refinement; and (iii) high efficiency, architectural simplicity, and strong generalization—particularly under data-scarce conditions.
📝 Abstract
Collecting high quality data for object detection tasks is challenging due to the inherent subjectivity in labeling the boundaries of an object. This makes it difficult to not only collect consistent annotations across a dataset but also to validate them, as no two annotators are likely to label the same object using the exact same coordinates. These challenges are further compounded when object boundaries are partially visible or blurred, which can be the case in many domains. Training on noisy annotations significantly degrades detector performance, rendering them unusable, particularly in few-shot settings, where just a few corrupted annotations can impact model performance. In this work, we propose FMG-Det, a simple, efficient methodology for training models with noisy annotations. More specifically, we propose combining a multiple instance learning (MIL) framework with a pre-processing pipeline that leverages powerful foundation models to correct labels prior to training. This pre-processing pipeline, along with slight modifications to the detector head, results in state-of-the-art performance across a number of datasets, for both standard and few-shot scenarios, while being much simpler and more efficient than other approaches.