From Mapping to Composing: A Two-Stage Framework for Zero-shot Composed Image Retrieval

📅 2025-04-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Zero-shot compositional image retrieval (ZS CIR) faces challenges including weak semantic representation of pseudo-word tokens, training-inference inconsistency, and excessive reliance on large-scale synthetic data. This paper proposes a novel “Map-to-Compose” two-stage decoupled framework: (1) a vision-semantic injection stage that enhances fine-grained semantic alignment from images to pseudo-word tokens; and (2) a lightweight fine-tuning stage of the text encoder to achieve soft alignment and fusion between pseudo-words and modification texts. Our method eliminates the need for large-scale synthetic data and generalizes effectively across both high- and low-quality triplets. It achieves significant improvements over state-of-the-art methods on three standard benchmarks, with substantial performance gains attained using only a small number of synthetic samples. Key contributions include the first decoupled two-stage training paradigm for ZS CIR, the vision-semantic injection mechanism, and a novel soft text-alignment objective function.

Technology Category

Application Category

📝 Abstract
Composed Image Retrieval (CIR) is a challenging multimodal task that retrieves a target image based on a reference image and accompanying modification text. Due to the high cost of annotating CIR triplet datasets, zero-shot (ZS) CIR has gained traction as a promising alternative. Existing studies mainly focus on projection-based methods, which map an image to a single pseudo-word token. However, these methods face three critical challenges: (1) insufficient pseudo-word token representation capacity, (2) discrepancies between training and inference phases, and (3) reliance on large-scale synthetic data. To address these issues, we propose a two-stage framework where the training is accomplished from mapping to composing. In the first stage, we enhance image-to-pseudo-word token learning by introducing a visual semantic injection module and a soft text alignment objective, enabling the token to capture richer and fine-grained image information. In the second stage, we optimize the text encoder using a small amount of synthetic triplet data, enabling it to effectively extract compositional semantics by combining pseudo-word tokens with modification text for accurate target image retrieval. The strong visual-to-pseudo mapping established in the first stage provides a solid foundation for the second stage, making our approach compatible with both high- and low-quality synthetic data, and capable of achieving significant performance gains with only a small amount of synthetic data. Extensive experiments were conducted on three public datasets, achieving superior performance compared to existing approaches.
Problem

Research questions and friction points this paper is trying to address.

Enhancing pseudo-word token representation for image retrieval
Reducing training-inference discrepancy in zero-shot CIR
Minimizing reliance on large-scale synthetic triplet data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage framework enhances zero-shot CIR
Visual semantic injection improves pseudo-word tokens
Soft text alignment optimizes compositional semantics
🔎 Similar Papers
Yabing Wang
Yabing Wang
Xi’an Jiaotong University
multimodal learning
Zhuotao Tian
Zhuotao Tian
Professor, Harbin Institute of Technology (Shenzhen)
Vision-language ModelMulti-modal PerceptionComputer Vision
Qingpei Guo
Qingpei Guo
Ant Group
Multimodal LLMsVision-Language Models
Z
Zheng Qin
Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, Xi’an, Shaanxi 710049, China
Sanping Zhou
Sanping Zhou
Xi'an Jiaotong University
Computer VisionMachine Learning
M
Ming Yang
Ant Group, Hangzhou, 310000, China
L
Le Wang
Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, Xi’an, Shaanxi 710049, China