Zero-Shot Interactive Text-to-Image Retrieval via Diffusion-Augmented Representations

📅 2025-01-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing interactive text-to-image retrieval methods rely on fine-tuning multimodal large models, incurring high computational costs and poor generalization—especially under dynamic shifts in query and image distributions. Method: We propose a zero-shot interactive retrieval framework that avoids fine-tuning large models. It jointly models cross-modal representations and enhances contextual awareness via LLM-driven query semantic refinement and diffusion model (DM)-guided visual synthesis. Contribution/Results: Our work introduces the first “fine-tuning-free” paradigm, overcoming the bottleneck of pre-trained knowledge narrowness and enabling robust multi-round interaction and inference under distributional shift. On four standard benchmarks, our zero-shot approach matches state-of-the-art fine-tuned methods in retrieval accuracy; under multi-turn interaction, it achieves a 7.61% absolute gain in Hits@10, demonstrating significantly improved generalization and adaptability.

Technology Category

Application Category

📝 Abstract
Interactive Text-to-Image Retrieval (I-TIR) has emerged as a transformative user-interactive tool for applications in domains such as e-commerce and education. Yet, current methodologies predominantly depend on finetuned Multimodal Large Language Models (MLLMs), which face two critical limitations: (1) Finetuning imposes prohibitive computational overhead and long-term maintenance costs. (2) Finetuning narrows the pretrained knowledge distribution of MLLMs, reducing their adaptability to novel scenarios. These issues are exacerbated by the inherently dynamic nature of real-world I-TIR systems, where queries and image databases evolve in complexity and diversity, often deviating from static training distributions. To overcome these constraints, we propose Diffusion Augmented Retrieval (DAR), a paradigm-shifting framework that bypasses MLLM finetuning entirely. DAR synergizes Large Language Model (LLM)-guided query refinement with Diffusion Model (DM)-based visual synthesis to create contextually enriched intermediate representations. This dual-modality approach deciphers nuanced user intent more holistically, enabling precise alignment between textual queries and visually relevant images. Rigorous evaluations across four benchmarks reveal DAR's dual strengths: (1) Matches state-of-the-art finetuned I-TIR models on straightforward queries without task-specific training. (2) Scalable Generalization: Surpasses finetuned baselines by 7.61% in Hits@10 (top-10 accuracy) under multi-turn conversational complexity, demonstrating robustness to intricate, distributionally shifted interactions. By eliminating finetuning dependencies and leveraging generative-augmented representations, DAR establishes a new trajectory for efficient, adaptive, and scalable cross-modal retrieval systems.
Problem

Research questions and friction points this paper is trying to address.

Interactive Image Retrieval
Large Language Models
Generalization Limitations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion Enhanced Retrieval
Text-to-Image Search
Interactive Image Retrieval
🔎 Similar Papers
No similar papers found.
Z
Zijun Long
Hunan University, Changsha, Hunan, China
K
Kangheng Liang
University of Glasgow, Glasgow, United Kingdom
Gerardo Aragon-Camarasa
Gerardo Aragon-Camarasa
School of Computing Science, University of Glasgow
Robot VisionComputer VisionRobotics and Autonomous SystemsDeformable Objects
R
R. McCreadie
University of Glasgow, Glasgow, United Kingdom
Paul Henderson
Paul Henderson
University of Glasgow
computer visionmachine learning