A Little More Like This: Text-to-Image Retrieval with Vision-Language Models Using Relevance Feedback

πŸ“… 2025-11-21
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the performance bottleneck in text-to-image retrieval with vision-language models (VLMs) caused by query ambiguity. We propose a novel inference-time relevance feedback enhancement method that requires no fine-tuning or model expansion. To our knowledge, this is the first application of relevance feedback to VLM-based retrieval. Our approach introduces two complementary mechanisms: (i) generative feedback, leveraging an LLM to produce semantically refined queries, and (ii) attention-based feedback aggregation, which employs a Transformer to dynamically model attention weights across multiple feedback roundsβ€”jointly mitigating query drift. The method is compatible with standard VLMs (e.g., CLIP) and supports pseudo-relevance, explicit, and generative feedback simulation. Evaluated on Flickr30k and COCO, it improves MRR@5 by 3–5% for small models and 1–3% for large models. The Adaptive Feedback Selection (AFS) module enables performance approaching the upper bound of explicit feedback, significantly enhancing robustness and iterative retrieval capability.

Technology Category

Application Category

πŸ“ Abstract
Large vision-language models (VLMs) enable intuitive visual search using natural language queries. However, improving their performance often requires fine-tuning and scaling to larger model variants. In this work, we propose a mechanism inspired by traditional text-based search to improve retrieval performance at inference time: relevance feedback. While relevance feedback can serve as an alternative to fine-tuning, its model-agnostic design also enables use with fine-tuned VLMs. Specifically, we introduce and evaluate four feedback strategies for VLM-based retrieval. First, we revise classical pseudo-relevance feedback (PRF), which refines query embeddings based on top-ranked results. To address its limitations, we propose generative relevance feedback (GRF), which uses synthetic captions for query refinement. Furthermore, we introduce an attentive feedback summarizer (AFS), a custom transformer-based model that integrates multimodal fine-grained features from relevant items. Finally, we simulate explicit feedback using ground-truth captions as an upper-bound baseline. Experiments on Flickr30k and COCO with the VLM backbones show that GRF, AFS, and explicit feedback improve retrieval performance by 3-5% in MRR@5 for smaller VLMs, and 1-3% for larger ones, compared to retrieval with no feedback. Moreover, AFS, similarly to explicit feedback, mitigates query drift and is more robust than GRF in iterative, multi-turn retrieval settings. Our findings demonstrate that relevance feedback can consistently enhance retrieval across VLMs and open up opportunities for interactive and adaptive visual search.
Problem

Research questions and friction points this paper is trying to address.

Improving text-to-image retrieval performance without fine-tuning VLMs
Addressing limitations of pseudo-relevance feedback in visual search
Enhancing multimodal retrieval robustness in iterative search settings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative relevance feedback refines queries with synthetic captions
Attentive feedback summarizer integrates multimodal fine-grained features
Relevance feedback mechanisms enhance retrieval without requiring fine-tuning
πŸ”Ž Similar Papers
No similar papers found.