SORCE: Small Object Retrieval in Complex Environments

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-to-image retrieval (T2IR) methods struggle to localize small, non-salient objects within complex scenes. Method: We introduce a novel subtask—Small Object Retrieval in Complex Environments (SORCE)—and propose a region-prompt-driven multi-embedding representation framework. Departing from conventional single-vector image encoding, our approach employs learnable region prompts (ReP) to guide multimodal large language models (MLLMs) in extracting fine-grained, text-conditioned visual features and enabling fine-grained cross-modal alignment. Contribution/Results: To support SORCE, we construct SORCE-1K—the first dedicated benchmark for this task. Extensive experiments demonstrate that our method significantly outperforms mainstream T2IR models on SORCE-1K, validating the effectiveness of multi-embedding representations and prompt-driven MLLM features. This work establishes both a new benchmark and a principled technical pathway for small-object-oriented cross-modal retrieval.

Technology Category

Application Category

📝 Abstract
Text-to-Image Retrieval (T2IR) is a highly valuable task that aims to match a given textual query to images in a gallery. Existing benchmarks primarily focus on textual queries describing overall image semantics or foreground salient objects, possibly overlooking inconspicuous small objects, especially in complex environments. Such small object retrieval is crucial, as in real-world applications, the targets of interest are not always prominent in the image. Thus, we introduce SORCE (Small Object Retrieval in Complex Environments), a new subfield of T2IR, focusing on retrieving small objects in complex images with textual queries. We propose a new benchmark, SORCE-1K, consisting of images with complex environments and textual queries describing less conspicuous small objects with minimal contextual cues from other salient objects. Preliminary analysis on SORCE-1K finds that existing T2IR methods struggle to capture small objects and encode all the semantics into a single embedding, leading to poor retrieval performance on SORCE-1K. Therefore, we propose to represent each image with multiple distinctive embeddings. We leverage Multimodal Large Language Models (MLLMs) to extract multiple embeddings for each image instructed by a set of Regional Prompts (ReP). Experimental results show that our multi-embedding approach through MLLM and ReP significantly outperforms existing T2IR methods on SORCE-1K. Our experiments validate the effectiveness of SORCE-1K for benchmarking SORCE performances, highlighting the potential of multi-embedding representation and text-customized MLLM features for addressing this task.
Problem

Research questions and friction points this paper is trying to address.

Retrieving small objects in complex images using text queries
Existing methods fail to capture small objects in single embeddings
Proposing multi-embedding approach with MLLMs for better retrieval
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multiple distinctive embeddings per image
Multimodal Large Language Models (MLLMs)
Regional Prompts (ReP) for feature extraction
🔎 Similar Papers
No similar papers found.
Chunxu Liu
Chunxu Liu
Nanjing University
Video Frame InterpolationVision Language Model
C
Chi Xie
Sensetime Research; Tongji University
X
Xiaxu Chen
Sensetime Research; Beijing Institute of Technology
W
Wei Li
Sensetime Research
F
Feng Zhu
Sensetime Research
R
Rui Zhao
Sensetime Research
L
Limin Wang
State Key Laboratory for Novel Software Technology, Nanjing University; Shanghai AI Lab