π€ AI Summary
This work addresses the performance bottleneck in text-to-image retrieval with vision-language models (VLMs) caused by query ambiguity. We propose a novel inference-time relevance feedback enhancement method that requires no fine-tuning or model expansion. To our knowledge, this is the first application of relevance feedback to VLM-based retrieval. Our approach introduces two complementary mechanisms: (i) generative feedback, leveraging an LLM to produce semantically refined queries, and (ii) attention-based feedback aggregation, which employs a Transformer to dynamically model attention weights across multiple feedback roundsβjointly mitigating query drift. The method is compatible with standard VLMs (e.g., CLIP) and supports pseudo-relevance, explicit, and generative feedback simulation. Evaluated on Flickr30k and COCO, it improves MRR@5 by 3β5% for small models and 1β3% for large models. The Adaptive Feedback Selection (AFS) module enables performance approaching the upper bound of explicit feedback, significantly enhancing robustness and iterative retrieval capability.
π Abstract
Large vision-language models (VLMs) enable intuitive visual search using natural language queries. However, improving their performance often requires fine-tuning and scaling to larger model variants. In this work, we propose a mechanism inspired by traditional text-based search to improve retrieval performance at inference time: relevance feedback. While relevance feedback can serve as an alternative to fine-tuning, its model-agnostic design also enables use with fine-tuned VLMs. Specifically, we introduce and evaluate four feedback strategies for VLM-based retrieval. First, we revise classical pseudo-relevance feedback (PRF), which refines query embeddings based on top-ranked results. To address its limitations, we propose generative relevance feedback (GRF), which uses synthetic captions for query refinement. Furthermore, we introduce an attentive feedback summarizer (AFS), a custom transformer-based model that integrates multimodal fine-grained features from relevant items. Finally, we simulate explicit feedback using ground-truth captions as an upper-bound baseline. Experiments on Flickr30k and COCO with the VLM backbones show that GRF, AFS, and explicit feedback improve retrieval performance by 3-5% in MRR@5 for smaller VLMs, and 1-3% for larger ones, compared to retrieval with no feedback. Moreover, AFS, similarly to explicit feedback, mitigates query drift and is more robust than GRF in iterative, multi-turn retrieval settings. Our findings demonstrate that relevance feedback can consistently enhance retrieval across VLMs and open up opportunities for interactive and adaptive visual search.