OVA-Det: Open Vocabulary Aerial Object Detection with Image-Text Collaboration

📅 2024-08-22

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

To address the limitation of aerial object detection in open-world scenarios—namely, its reliance on predefined categories—this paper proposes the first end-to-end zero-shot aerial object detection framework. Methodologically, it replaces conventional class regression with a novel image-to-text alignment loss; introduces a lightweight text-guided query mechanism that dynamically attends to semantically relevant features within the encoder-decoder architecture; and integrates image-text contrastive learning with multi-scale feature enhancement. Evaluated on three major benchmarks—including DIOR—the framework achieves state-of-the-art performance: 37.2 mAP and 79.8% recall in zero-shot detection on DIOR, while maintaining real-time inference at 36 FPS on an RTX 4090. This work pioneers deep integration of textual modality into aerial detection architectures, significantly improving generalization to unseen categories in open-world settings.

Technology Category

Application Category

📝 Abstract

Aerial object detection plays a crucial role in numerous applications. However, most existing methods focus on detecting predefined object categories, limiting their applicability in real-world open scenarios. In this paper, we extend aerial object detection to open scenarios through image-text collaboration and propose OVA-Det, a highly efficient open-vocabulary detector for aerial scenes. Specifically, we first introduce an image-to-text alignment loss to replace the conventional category regression loss, thereby eliminating category limitations. Next, we propose a lightweight text-guided strategy that enhances the feature extraction process in the encoder and enables queries to focus on class-relevant image features within the decoder, further improving detection accuracy without introducing significant additional costs. Extensive comparison experiments demonstrate that the proposed OVA-Det outperforms state-of-the-art methods on all three widely used benchmark datasets by a large margin. For instance, for zero-shot detection on DIOR, OVA-Det achieves 37.2 mAP and 79.8 Recall, 12.4 and 42.0 higher than that of YOLO-World. In addition, the inference speed of OVA-Det reaches 36 FPS on RTX 4090, meeting the real-time detection requirements for various applications. The code is available at href{https://github.com/GT-Wei/OVA-Det}{https://github.com/GT-Wei/OVA-Det}.

Problem

Research questions and friction points this paper is trying to address.

Extends aerial object detection to open scenarios.

Proposes OVA-Det for efficient open-vocabulary detection.

Improves detection accuracy with lightweight text-guided strategy.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Image-text alignment loss replaces category regression

Lightweight text-guided strategy enhances feature extraction

Real-time detection at 36 FPS on RTX 4090

🔎 Similar Papers

A Lightweight Modular Framework for Low-Cost Open-Vocabulary Object Detection Training