R^3: Composed Video Retrieval via Reasoning-Guided Recalling and Re-ranking

📅 2026-05-31
📈 Citations: 0
Influential: 0
📄 PDF

career value

186K/year
🤖 AI Summary
This work addresses the task of compositional video retrieval—retrieving a target video from a large-scale database given a reference video and a textual editing instruction—by jointly modeling visual content and edit semantics. The authors propose R³, a novel framework that parses the editing instruction into a zero-shot reasoning trajectory to construct a reasoning-augmented query for efficient recall. R³ further incorporates multimodal embedding fusion, a protocol-gated residual scoring module, and a source-to-candidate direct contrastive reranking mechanism to jointly optimize semantic precision and retrieval efficiency. Evaluated on the CoVR-R benchmark, R³ demonstrates substantial improvements in both retrieval accuracy and robustness.
📝 Abstract
The CoVR-R challenge evaluates composed video retrieval, where a system must retrieve a target video from a large gallery given a reference video and a textual edit instruction. This setting is not a standard video-text retrieval problem: the query is defined by both the visual evidence in the source video and the transformation implied by the edit. A strong embedding model can provide scalable candidate recall, but it may under-express target-side consequences such as state changes, action replacement, object preservation, or temporal consistency. A pairwise multimodal reranker can verify such details more directly, but exhaustive reranking over the full gallery is computationally infeasible. We present $\mathbb{R}^3$, a zero-shot composed video retrieval pipeline built around Reasoning-guided Recalling and Reranking. The core idea is to turn the source-edit query into a reasoning-grounded retrieval program rather than treating the edit text as a short caption. First, the model generates a reasoning trace that describes the expected target video after applying the edit. Then the trace is encoded together with the source video as a reasoning-augmented query, and its retrieval score is fused with the base composed query through an agreement-gated residual rule. At last, a re-ranker verifies the recalled candidates with direct source-candidate comparison. Experiments have demonstrated the effectiveness of our method in addressing this challenge. Codes are available on https://github.com/Lee-zixu/R-3.
Problem

Research questions and friction points this paper is trying to address.

composed video retrieval
video-text retrieval
edit instruction
multimodal reranking
reasoning-guided retrieval
Innovation

Methods, ideas, or system contributions that make the work stand out.

composed video retrieval
reasoning-guided retrieval
zero-shot learning
multimodal reranking
video-text reasoning
🔎 Similar Papers