Object-Centric Open-Vocabulary Image-Retrieval with Aggregated Features

📅 2023-09-26

🏛️ British Machine Vision Conference

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This work addresses the challenging problem of object-centric small-object image retrieval under open-vocabulary settings—i.e., precisely localizing specific small-scale objects in images using free-text queries. To this end, we propose a novel CLIP dense feature aggregation framework that generates compact, scalable, and localization-aware image representations via spatially aware local feature encoding and global semantic alignment. Our method integrates CLIP-based dense feature extraction, spatial-aggregated encoding, and contrastive learning-driven open-vocabulary alignment. Evaluated on three standard benchmarks, it achieves up to a 15.0-point improvement in mean Average Precision (mAP) over conventional global embedding approaches, while supporting efficient large-scale deployment and offering strong visual interpretability. The core contribution lies in the first effective and efficient aggregation of CLIP dense features—enabling fine-grained object localization without compromising retrieval efficiency.

📝 Abstract

The task of open-vocabulary object-centric image retrieval involves the retrieval of images containing a specified object of interest, delineated by an open-set text query. As working on large image datasets becomes standard, solving this task efficiently has gained significant practical importance. Applications include targeted performance analysis of retrieved images using ad-hoc queries and hard example mining during training. Recent advancements in contrastive-based open vocabulary systems have yielded remarkable breakthroughs, facilitating large-scale open vocabulary image retrieval. However, these approaches use a single global embedding per image, thereby constraining the system's ability to retrieve images containing relatively small object instances. Alternatively, incorporating local embeddings from detection pipelines faces scalability challenges, making it unsuitable for retrieval from large databases. In this work, we present a simple yet effective approach to object-centric open-vocabulary image retrieval. Our approach aggregates dense embeddings extracted from CLIP into a compact representation, essentially combining the scalability of image retrieval pipelines with the object identification capabilities of dense detection methods. We show the effectiveness of our scheme to the task by achieving significantly better results than global feature approaches on three datasets, increasing accuracy by up to 15 mAP points. We further integrate our scheme into a large scale retrieval framework and demonstrate our method's advantages in terms of scalability and interpretability.

Problem

Research questions and friction points this paper is trying to address.

Large-scale Image Retrieval

Specific Small Object Detection

Open Descriptive Text

Innovation

Methods, ideas, or system contributions that make the work stand out.

Object-Centric Retrieval

Dense Information Integration

CLIP Utilization

🔎 Similar Papers

Unsupervised Collaborative Metric Learning with Mixed-Scale Groups for General Object Retrieval