OVA-Det: Open Vocabulary Aerial Object Detection with Image-Text Collaboration

📅 2024-08-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limitation of aerial object detection in open-world scenarios—namely, its reliance on predefined categories—this paper proposes the first end-to-end zero-shot aerial object detection framework. Methodologically, it replaces conventional class regression with a novel image-to-text alignment loss; introduces a lightweight text-guided query mechanism that dynamically attends to semantically relevant features within the encoder-decoder architecture; and integrates image-text contrastive learning with multi-scale feature enhancement. Evaluated on three major benchmarks—including DIOR—the framework achieves state-of-the-art performance: 37.2 mAP and 79.8% recall in zero-shot detection on DIOR, while maintaining real-time inference at 36 FPS on an RTX 4090. This work pioneers deep integration of textual modality into aerial detection architectures, significantly improving generalization to unseen categories in open-world settings.

Technology Category

Application Category

📝 Abstract
Aerial object detection plays a crucial role in numerous applications. However, most existing methods focus on detecting predefined object categories, limiting their applicability in real-world open scenarios. In this paper, we extend aerial object detection to open scenarios through image-text collaboration and propose OVA-Det, a highly efficient open-vocabulary detector for aerial scenes. Specifically, we first introduce an image-to-text alignment loss to replace the conventional category regression loss, thereby eliminating category limitations. Next, we propose a lightweight text-guided strategy that enhances the feature extraction process in the encoder and enables queries to focus on class-relevant image features within the decoder, further improving detection accuracy without introducing significant additional costs. Extensive comparison experiments demonstrate that the proposed OVA-Det outperforms state-of-the-art methods on all three widely used benchmark datasets by a large margin. For instance, for zero-shot detection on DIOR, OVA-Det achieves 37.2 mAP and 79.8 Recall, 12.4 and 42.0 higher than that of YOLO-World. In addition, the inference speed of OVA-Det reaches 36 FPS on RTX 4090, meeting the real-time detection requirements for various applications. The code is available at href{https://github.com/GT-Wei/OVA-Det}{https://github.com/GT-Wei/OVA-Det}.
Problem

Research questions and friction points this paper is trying to address.

Extends aerial object detection to open scenarios.
Proposes OVA-Det for efficient open-vocabulary detection.
Improves detection accuracy with lightweight text-guided strategy.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Image-text alignment loss replaces category regression
Lightweight text-guided strategy enhances feature extraction
Real-time detection at 36 FPS on RTX 4090
🔎 Similar Papers
No similar papers found.
G
Guoting Wei
Nanjing University of Science and Technology, Intellifusion
X
Xia Yuan
Nanjing University of Science and Technology
Y
Yu Liu
Zhejiang Lab
Z
Zhenhao Shang
Northwestern Polytechnical University
K
Kelu Yao
Zhejiang Lab
C
Chao Li
Zhejiang Lab
Qingsen Yan
Qingsen Yan
Northwestern Polytechnical University
Image processingImage fusionContinual learning
C
Chunxia Zhao
Nanjing University of Science and Technology
Haokui Zhang
Haokui Zhang
Northwestern Polytechnical University
Approximate nearest neighbor searchneural architecture searchdepth estimationHSI classificaion
R
Rong Xiao
Intellifusion