Reason, Retrieve, Re-rank: A Zero-Shot Reasoning-Aware Framework for Composed Video Retrieval

📅 2026-05-30

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This work addresses the problem of retrieving semantically aligned target videos from free-form textual editing instructions applied to reference videos under a zero-shot setting. To this end, it proposes R3-CoVR, a training-free framework that uniquely integrates three stages: multimodal large language model reasoning, contrastive retrieval, and constraint-aware reranking. Specifically, Qwen3-VL-8B first generates a semantically edited description, followed by initial candidate retrieval using SigLIP-2, and finally a constraint-aware reranking step to enhance precision. Experimental results demonstrate that matching description length and the reranking mechanism substantially improve performance, with reranking alone boosting R@1 by 19.2%. The method achieves 91.9% R@1 and 98.2% R@10 on the CVPR 2026 VidLLMs Challenge test set, significantly outperforming existing baselines.

📝 Abstract

Composed Video Retrieval (CoVR) seeks the target video that results from applying a free-form textual modification to a reference video. We address the \emph{Reason-Aware} CoVR (CoVR-R) challenge at the CVPR~2026 VidLLMs workshop, where retrieval is strictly zero-shot. We present \textbf{R3-CoVR} (\emph{Reason, Retrieve, Re-rank}), a training-free pipeline built entirely from frozen foundation models. A multimodal large language model (Qwen3-VL-8B) reasons about the \emph{after-effects} an edit implies -- state transitions, action phases, scene, camera and tempo -- and verbalises a concise post-edit description; a contrastive video--text encoder (SigLIP-2) embeds this description and the gallery for first-stage retrieval; finally a constraint-aware re-ranking stage uses the same multimodal model as a judge that scores each shortlisted candidate against the intended edited result. On the challenge test set, R3-CoVR attains \textbf{91.9\% R@1} and \textbf{98.2\% R@10}. Two findings drive these results: (i)~matching the description length to the contrastive encoder's text window lifts \Rk{1} from $67.5$ to $72.7$; and (ii)~the constraint-aware re-ranker, which reorders only the shortlist, lifts \Rk{1} from $72.7$ to $91.9$ -- the single largest gain. We analyse the re-ranker's behaviour, the retrieve/re-rank blend, and the shortlist depth, and we release a clean three-layer implementation.

Problem

Research questions and friction points this paper is trying to address.

Composed Video Retrieval

Zero-Shot

Reason-Aware

Video Retrieval

Multimodal Reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-Shot Reasoning

Composed Video Retrieval

Multimodal Large Language Model