VL-DINO: Leveraging CLIP Vision-Language Knowledge for Open-Vocabulary Object Detectio

📅 2026-06-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the insufficient integration of visual–language knowledge in open-vocabulary object detection by proposing VL-DINO, a framework that deeply integrates CLIP into the DINO detector. VL-DINO introduces three key components—Query-guided Positive Sample Construction (QPSC), Visual Semantic Encoder (VSE), and Object-Region Semantic Alignment (ORSA)—to jointly enable cross-modal alignment and semantic distillation under heterogeneous data distributions. This synergistic design substantially enhances zero-shot generalization performance. Evaluated on LVIS under the zero-shot setting, VL-DINO-T and VL-DINO-L achieve 36.3 and 38.1 AP, respectively, outperforming current state-of-the-art methods.

📝 Abstract

Vision-language models like CLIP can provide rich semantic priors for open-vocabulary object detection. However, jointly integrating both textual and visual knowledge into detection architectures remains challenging. In this paper, we propose VL-DINO, an open-vocabulary detector that enhances DINO through more effective exploitation of CLIP's vision-language knowledge. Specifically, a Query-guided Positive Sample Construction (QPSC) module is first developed to construct additional high-quality positive samples, enabling the vanilla DINO framework to better accommodate mixed training across heterogeneous data sources while providing more vision-language alignment signals, thereby incorporating richer textual knowledge during training. A Visual Semantic Encoder (VSE) module is then introduced to distill CLIP visual knowledge into backbone-extracted features, producing fused features for subsequent encoder refinement. Based on the fused features, an Object-Region Semantic Alignment (ORSA) module extracts object-centric region features and aligns them with the corresponding textual embeddings, further incorporating textual cues. In the zero-shot setting, VL-DINO-T and VL-DINO-L achieve 36.3 and 38.1 AP on the LVIS benchmark, respectively, consistently outperforming prior advanced approaches. Extensive experiments demonstrate the effectiveness and competitive performance of the proposed design.

Problem

Research questions and friction points this paper is trying to address.

open-vocabulary object detection

vision-language models

CLIP

semantic alignment

heterogeneous data

Innovation

Methods, ideas, or system contributions that make the work stand out.

open-vocabulary object detection

vision-language alignment

CLIP knowledge distillation