Never Seen Before: Benchmarking Genuine Zero-Shot Composed Image Retrieval with Consistent Video-Sourced Datasets

๐Ÿ“… 2026-06-05
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the limitations of existing zero-shot compositional image retrieval (ZS-CIR) benchmarks, which often fail to ensure a true zero-shot setting due to reliance on publicly available images seen during pretraining and weak semantic alignment between reference and target images. To overcome these issues, the authors introduce ZeroSight, a new benchmark constructed from videos released after March 2022, ensuring visual and semantic consistency in referenceโ€“target pairs and employing large language models to generate relative textual descriptions that reflect genuine zero-shot scenarios. Additionally, they propose SC4CIR, a training-free method that leverages a triple symmetric consistency mechanism to identify hard negative samples, offering a plug-and-play enhancement for diverse CIR models. Experiments across 27 methods demonstrate that current evaluations substantially overestimate model performance, whereas ZeroSight provides a more reliable benchmark and SC4CIR consistently improves retrieval effectiveness.
๐Ÿ“ Abstract
Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve a target image based on a query composed of a reference image and a relative caption without training samples. Existing ZS-CIR datasets often suffer from complete irrelevance between reference and target images due to noisy image sources, and do not achieve a true zero-shot scenario as they use public image datasets that models like CLIP have been trained on. To tackle these challenges, we introduce ZeroSight, a novel benchmark for ZS-CIR. It includes a dataset with consistent reference-target pairs sourced from videos, a data construction pipeline, and evaluation methods that consider the ranking of multiple positive and negative target images. We ensure visually and semantically consistent reference-target pairs by extracting frames from a single video and generating relative captions using LLM-assisted methods. To ensure a true zero-shot scenario, we use video data published after March 31, 2022, ensuring it was not included in CLIP's pre-training data. Additionally, we propose a training-free MLLM-driven method, SC4CIR (Symmetric Consistency for CIR), which can effectively identify hard negative targets through 3 symmetric consistency checks. This method is plug-and-play, seamlessly integrating with various CIR methods and significantly improving performance. Our experimental results from 27 methods reveal that current ZS-CIR datasets and evaluation metrics result in inflated retrieval performance, exaggerating the capabilities of CIR methods. Our benchmark and models can be accessed at https://github.com/sotayang/ZeroSight.
Problem

Research questions and friction points this paper is trying to address.

Zero-Shot Composed Image Retrieval
genuine zero-shot
consistent reference-target pairs
video-sourced datasets
CLIP pre-training data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-Shot Composed Image Retrieval
Video-Sourced Dataset
True Zero-Shot Benchmark
Symmetric Consistency
Training-Free MLLM