LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models

๐Ÿ“… 2025-01-31
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses open-vocabulary image recognition, aiming to enhance model capability in detecting and describing unseen categories. We propose LLMDet, a framework that leverages large language models (LLMs) to generate both image-level detailed captions and region-level short descriptions, thereby establishing dual-granularity supervision signals. Based on this, we construct GroundingCap-1M, a large-scale multimodal dataset, and design a joint grounding loss and visionโ€“language generation loss to enable bidirectional co-optimization between the detector and multimodal model. Our method requires only image-level annotations, significantly improving open-vocabulary generalization. On benchmarks including LVIS, it achieves substantial mAP gains over state-of-the-art methods while also boosting the performance of underlying multimodal models. All code, models, and the GroundingCap-1M dataset are publicly released.

Technology Category

Application Category

๐Ÿ“ Abstract
Recent open-vocabulary detectors achieve promising performance with abundant region-level annotated data. In this work, we show that an open-vocabulary detector co-training with a large language model by generating image-level detailed captions for each image can further improve performance. To achieve the goal, we first collect a dataset, GroundingCap-1M, wherein each image is accompanied by associated grounding labels and an image-level detailed caption. With this dataset, we finetune an open-vocabulary detector with training objectives including a standard grounding loss and a caption generation loss. We take advantage of a large language model to generate both region-level short captions for each region of interest and image-level long captions for the whole image. Under the supervision of the large language model, the resulting detector, LLMDet, outperforms the baseline by a clear margin, enjoying superior open-vocabulary ability. Further, we show that the improved LLMDet can in turn build a stronger large multi-modal model, achieving mutual benefits. The code, model, and dataset is available at https://github.com/iSEE-Laboratory/LLMDet.
Problem

Research questions and friction points this paper is trying to address.

Open Vocabulary Image Recognition
Object Localization
Image Captioning
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMDet
Language Model Integration
GroundingCap-1M Dataset
๐Ÿ”Ž Similar Papers
No similar papers found.
Shenghao Fu
Shenghao Fu
Sun Yat-sen University
computer visionobject detectionlarge multi-modal models
Qize Yang
Qize Yang
Tongyi Lab, Alibaba Group
Computer VisionDeep Learning
Q
Qijie Mo
School of Computer Science and Engineering, Sun Yat-sen University, China; Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, China
Junkai Yan
Junkai Yan
Insta360
Self-supervised learningObject detectionMultimodal learning
X
Xihan Wei
Tongyi Lab, Alibaba Group
Jingke Meng
Jingke Meng
Sun Yat-Sen University
Computer Vision
X
Xiaohua Xie
School of Computer Science and Engineering, Sun Yat-sen University, China; Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, China; Guangdong Province Key Laboratory of Information Security Technology, China
Wei-Shi Zheng
Wei-Shi Zheng
Professor @ SUN YAT-SEN UNIVERSITY
Computer VisionPattern RecognitionMachine Learning