๐ค AI Summary
This work addresses open-vocabulary image recognition, aiming to enhance model capability in detecting and describing unseen categories. We propose LLMDet, a framework that leverages large language models (LLMs) to generate both image-level detailed captions and region-level short descriptions, thereby establishing dual-granularity supervision signals. Based on this, we construct GroundingCap-1M, a large-scale multimodal dataset, and design a joint grounding loss and visionโlanguage generation loss to enable bidirectional co-optimization between the detector and multimodal model. Our method requires only image-level annotations, significantly improving open-vocabulary generalization. On benchmarks including LVIS, it achieves substantial mAP gains over state-of-the-art methods while also boosting the performance of underlying multimodal models. All code, models, and the GroundingCap-1M dataset are publicly released.
๐ Abstract
Recent open-vocabulary detectors achieve promising performance with abundant region-level annotated data. In this work, we show that an open-vocabulary detector co-training with a large language model by generating image-level detailed captions for each image can further improve performance. To achieve the goal, we first collect a dataset, GroundingCap-1M, wherein each image is accompanied by associated grounding labels and an image-level detailed caption. With this dataset, we finetune an open-vocabulary detector with training objectives including a standard grounding loss and a caption generation loss. We take advantage of a large language model to generate both region-level short captions for each region of interest and image-level long captions for the whole image. Under the supervision of the large language model, the resulting detector, LLMDet, outperforms the baseline by a clear margin, enjoying superior open-vocabulary ability. Further, we show that the improved LLMDet can in turn build a stronger large multi-modal model, achieving mutual benefits. The code, model, and dataset is available at https://github.com/iSEE-Laboratory/LLMDet.