🤖 AI Summary
Existing object detectors often learn task-driven features that rely on shortcut correlations, failing to adequately capture the underlying annotation structure, which limits their generalization, interpretability, and robustness under task shifts or sparse supervision. To address this, this work proposes an annotation-guided feature enhancement framework that explicitly integrates geometric annotation priors into feature learning for the first time. By constructing a dense spatial feature grid and injecting it into the backbone network—where it fuses with the feature pyramid—the method steers region proposal and detection heads toward representations better aligned with annotation structure. Evaluated on wildlife and remote sensing datasets, the approach significantly improves object focus, reduces background sensitivity, and demonstrates superior generalization and data efficiency in weakly supervised and unseen-task settings.
📝 Abstract
Task-driven features learned by modern object detectors optimize end task loss yet often capture shortcut correlations that fail to reflect underlying annotation structure. Such representations limit transfer, interpretability, and robustness when task definitions change or supervision becomes sparse. This paper introduces an annotation-guided feature augmentation framework that injects embeddings into an object detection backbone. The method constructs dense spatial feature grids from annotation-guided latent spaces and fuses them with feature pyramid representations to influence region proposal and detection heads. Experiments across wildlife and remote sensing datasets evaluate classification, localization, and data efficiency under multiple supervision regimes. Results show consistent improvements in object focus, reduced background sensitivity, and stronger generalization to unseen or weakly supervised tasks. The findings demonstrate that aligning features with annotation geometry yields more meaningful representations than purely task optimized features.