R^3: Composed Video Retrieval via Reasoning-Guided Recalling and Re-ranking

📅 2026-05-31

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work addresses the task of compositional video retrieval—retrieving a target video from a large-scale database given a reference video and a textual editing instruction—by jointly modeling visual content and edit semantics. The authors propose R³, a novel framework that parses the editing instruction into a zero-shot reasoning trajectory to construct a reasoning-augmented query for efficient recall. R³ further incorporates multimodal embedding fusion, a protocol-gated residual scoring module, and a source-to-candidate direct contrastive reranking mechanism to jointly optimize semantic precision and retrieval efficiency. Evaluated on the CoVR-R benchmark, R³ demonstrates substantial improvements in both retrieval accuracy and robustness.

📝 Abstract

The CoVR-R challenge evaluates composed video retrieval, where a system must retrieve a target video from a large gallery given a reference video and a textual edit instruction. This setting is not a standard video-text retrieval problem: the query is defined by both the visual evidence in the source video and the transformation implied by the edit. A strong embedding model can provide scalable candidate recall, but it may under-express target-side consequences such as state changes, action replacement, object preservation, or temporal consistency. A pairwise multimodal reranker can verify such details more directly, but exhaustive reranking over the full gallery is computationally infeasible. We present $\mathbb{R}^3$, a zero-shot composed video retrieval pipeline built around Reasoning-guided Recalling and Reranking. The core idea is to turn the source-edit query into a reasoning-grounded retrieval program rather than treating the edit text as a short caption. First, the model generates a reasoning trace that describes the expected target video after applying the edit. Then the trace is encoded together with the source video as a reasoning-augmented query, and its retrieval score is fused with the base composed query through an agreement-gated residual rule. At last, a re-ranker verifies the recalled candidates with direct source-candidate comparison. Experiments have demonstrated the effectiveness of our method in addressing this challenge. Codes are available on https://github.com/Lee-zixu/R-3.

Problem

Research questions and friction points this paper is trying to address.

composed video retrieval

video-text retrieval

edit instruction

multimodal reranking

reasoning-guided retrieval

Innovation

Methods, ideas, or system contributions that make the work stand out.

composed video retrieval

reasoning-guided retrieval

zero-shot learning