🤖 AI Summary
To address the limitation of aerial object detection in open-world scenarios—namely, its reliance on predefined categories—this paper proposes the first end-to-end zero-shot aerial object detection framework. Methodologically, it replaces conventional class regression with a novel image-to-text alignment loss; introduces a lightweight text-guided query mechanism that dynamically attends to semantically relevant features within the encoder-decoder architecture; and integrates image-text contrastive learning with multi-scale feature enhancement. Evaluated on three major benchmarks—including DIOR—the framework achieves state-of-the-art performance: 37.2 mAP and 79.8% recall in zero-shot detection on DIOR, while maintaining real-time inference at 36 FPS on an RTX 4090. This work pioneers deep integration of textual modality into aerial detection architectures, significantly improving generalization to unseen categories in open-world settings.
📝 Abstract
Aerial object detection plays a crucial role in numerous applications. However, most existing methods focus on detecting predefined object categories, limiting their applicability in real-world open scenarios. In this paper, we extend aerial object detection to open scenarios through image-text collaboration and propose OVA-Det, a highly efficient open-vocabulary detector for aerial scenes. Specifically, we first introduce an image-to-text alignment loss to replace the conventional category regression loss, thereby eliminating category limitations. Next, we propose a lightweight text-guided strategy that enhances the feature extraction process in the encoder and enables queries to focus on class-relevant image features within the decoder, further improving detection accuracy without introducing significant additional costs. Extensive comparison experiments demonstrate that the proposed OVA-Det outperforms state-of-the-art methods on all three widely used benchmark datasets by a large margin. For instance, for zero-shot detection on DIOR, OVA-Det achieves 37.2 mAP and 79.8 Recall, 12.4 and 42.0 higher than that of YOLO-World. In addition, the inference speed of OVA-Det reaches 36 FPS on RTX 4090, meeting the real-time detection requirements for various applications. The code is available at href{https://github.com/GT-Wei/OVA-Det}{https://github.com/GT-Wei/OVA-Det}.