Scaling Prompt Instructed Zero Shot Composed Image Retrieval with Image-Only Data

📅 2025-04-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high data cost and poor scalability of zero-shot compositional image retrieval (CIR), which traditionally relies on manually annotated triplets (reference image–modification text–target image). We propose InstructCIR, a novel framework that eliminates manual annotation by introducing the first fully automated CIR data synthesis paradigm, leveraging large language models and image captioning models. Its core innovation is an embedded reformulation architecture that enables efficient cross-modal alignment and fusion between vision and language. The method jointly optimizes prompt-driven synthetic data generation, contrastive learning, and multimodal embedding. On CIRR and FashionIQ, InstructCIR achieves state-of-the-art zero-shot performance. Moreover, as synthetic data volume increases, its zero-shot results progressively approach fully supervised baselines—demonstrating significantly improved training efficiency and generalization capability.

Technology Category

Application Category

📝 Abstract
Composed Image Retrieval (CIR) is the task of retrieving images matching a reference image augmented with a text, where the text describes changes to the reference image in natural language. Traditionally, models designed for CIR have relied on triplet data containing a reference image, reformulation text, and a target image. However, curating such triplet data often necessitates human intervention, leading to prohibitive costs. This challenge has hindered the scalability of CIR model training even with the availability of abundant unlabeled data. With the recent advances in foundational models, we advocate a shift in the CIR training paradigm where human annotations can be efficiently replaced by large language models (LLMs). Specifically, we demonstrate the capability of large captioning and language models in efficiently generating data for CIR only relying on unannotated image collections. Additionally, we introduce an embedding reformulation architecture that effectively combines image and text modalities. Our model, named InstructCIR, outperforms state-of-the-art methods in zero-shot composed image retrieval on CIRR and FashionIQ datasets. Furthermore, we demonstrate that by increasing the amount of generated data, our zero-shot model gets closer to the performance of supervised baselines.
Problem

Research questions and friction points this paper is trying to address.

Replacing human-annotated triplet data with LLM-generated data
Scaling CIR training using unannotated image collections
Improving zero-shot composed image retrieval performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses LLMs to replace human annotations
Generates CIR data from unannotated images
Introduces embedding reformulation architecture
🔎 Similar Papers
No similar papers found.