Fine-Grained Zero-Shot Object Detection

📅 2025-07-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper introduces fine-grained zero-shot detection (FG-ZSD), a novel task aiming to localize and recognize visually similar unseen fine-grained classes (e.g., bird species) under zero-shot settings. To address this challenge, we construct FGZSD-Birds—the first dedicated benchmark dataset for FG-ZSD—and propose Multi-level Semantic-aware Hierarchical Alignment (MSHC), a two-stage detector enhanced with hierarchical vision–semantic alignment losses to improve feature disentanglement among semantically proximate categories. MSHC jointly models visual appearance and fine-grained semantic attributes across multiple granularity levels, enabling robust generalization to unseen classes with minimal inter-class variation. Extensive experiments demonstrate that MSHC significantly outperforms existing zero-shot detection methods on FGZSD-Birds, validating its effectiveness in capturing discriminative fine-grained features. This work establishes a new paradigm, dataset, and methodology for fine-grained open-world object detection.

Technology Category

Application Category

📝 Abstract
Zero-shot object detection (ZSD) aims to leverage semantic descriptions to localize and recognize objects of both seen and unseen classes. Existing ZSD works are mainly coarse-grained object detection, where the classes are visually quite different, thus are relatively easy to distinguish. However, in real life we often have to face fine-grained object detection scenarios, where the classes are too similar to be easily distinguished. For example, detecting different kinds of birds, fishes, and flowers. In this paper, we propose and solve a new problem called Fine-Grained Zero-Shot Object Detection (FG-ZSD for short), which aims to detect objects of different classes with minute differences in details under the ZSD paradigm. We develop an effective method called MSHC for the FG-ZSD task, which is based on an improved two-stage detector and employs a multi-level semantics-aware embedding alignment loss, ensuring tight coupling between the visual and semantic spaces. Considering that existing ZSD datasets are not suitable for the new FG-ZSD task, we build the first FG-ZSD benchmark dataset FGZSD-Birds, which contains 148,820 images falling into 36 orders, 140 families, 579 genera and 1432 species. Extensive experiments on FGZSD-Birds show that our method outperforms existing ZSD models.
Problem

Research questions and friction points this paper is trying to address.

Detect fine-grained objects with minute visual differences
Leverage semantic descriptions for unseen and seen classes
Address lack of suitable datasets for fine-grained zero-shot detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Improved two-stage detector for FG-ZSD
Multi-level semantics-aware embedding alignment loss
First FG-ZSD benchmark dataset FGZSD-Birds
🔎 Similar Papers
No similar papers found.
Hongxu Ma
Hongxu Ma
Google
C
Chenbo Zhang
Fudan University, Shanghai, China
L
Lu Zhang
Fudan University, Shanghai, China
J
Jiaogen Zhou
School of Geography and Planning, Huaiyin Normal University, Huaian, China
Jihong Guan
Jihong Guan
Professor of Computer Science, Tongji University
Data Mining and ManagementMachine LearningBioinformatics
Shuigeng Zhou
Shuigeng Zhou
Fudan University
DatabaseBioinformaticsMachine Learning