Never Seen Before: Benchmarking Genuine Zero-Shot Composed Image Retrieval with Consistent Video-Sourced Datasets

📅 2026-06-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the limitations of existing zero-shot compositional image retrieval (ZS-CIR) benchmarks, which often fail to ensure a true zero-shot setting due to reliance on publicly available images seen during pretraining and weak semantic alignment between reference and target images. To overcome these issues, the authors introduce ZeroSight, a new benchmark constructed from videos released after March 2022, ensuring visual and semantic consistency in reference–target pairs and employing large language models to generate relative textual descriptions that reflect genuine zero-shot scenarios. Additionally, they propose SC4CIR, a training-free method that leverages a triple symmetric consistency mechanism to identify hard negative samples, offering a plug-and-play enhancement for diverse CIR models. Experiments across 27 methods demonstrate that current evaluations substantially overestimate model performance, whereas ZeroSight provides a more reliable benchmark and SC4CIR consistently improves retrieval effectiveness.

📝 Abstract

Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve a target image based on a query composed of a reference image and a relative caption without training samples. Existing ZS-CIR datasets often suffer from complete irrelevance between reference and target images due to noisy image sources, and do not achieve a true zero-shot scenario as they use public image datasets that models like CLIP have been trained on. To tackle these challenges, we introduce ZeroSight, a novel benchmark for ZS-CIR. It includes a dataset with consistent reference-target pairs sourced from videos, a data construction pipeline, and evaluation methods that consider the ranking of multiple positive and negative target images. We ensure visually and semantically consistent reference-target pairs by extracting frames from a single video and generating relative captions using LLM-assisted methods. To ensure a true zero-shot scenario, we use video data published after March 31, 2022, ensuring it was not included in CLIP's pre-training data. Additionally, we propose a training-free MLLM-driven method, SC4CIR (Symmetric Consistency for CIR), which can effectively identify hard negative targets through 3 symmetric consistency checks. This method is plug-and-play, seamlessly integrating with various CIR methods and significantly improving performance. Our experimental results from 27 methods reveal that current ZS-CIR datasets and evaluation metrics result in inflated retrieval performance, exaggerating the capabilities of CIR methods. Our benchmark and models can be accessed at https://github.com/sotayang/ZeroSight.

Problem

Research questions and friction points this paper is trying to address.

Zero-Shot Composed Image Retrieval

genuine zero-shot

consistent reference-target pairs

video-sourced datasets

CLIP pre-training data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-Shot Composed Image Retrieval

Video-Sourced Dataset

True Zero-Shot Benchmark