Training-Free Composed Video Retrieval via Visual Representation-Guided Video-LLM Reasoning

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the task of video retrieval jointly driven by a reference video and a textual editing instruction, proposing a training-free compositional retrieval method. The approach first leverages a frozen DINOv3 model to select visually similar candidate videos and then employs a Video Large Language Model (Video-LLM) to assess their semantic alignment with the given text instruction. A reasoning-based post-processing mechanism is further introduced to refine the ranking results. As the first framework capable of integrating visual representations and language-driven reasoning without fine-tuning, the method achieves substantial performance gains under complex query scenarios, attaining Recall@1 of 48.78 and Recall@5 of 51.48 on the CVPR 2026 Challenge test set.

📝 Abstract

Recent advances in large vision-language models have expanded video retrieval from simple text-based search to more flexible scenarios, where users may specify the desired result through both visual examples and textual instructions. In the CVPR 2026 Reason-Aware Composed Video Retrieval Challenge, the system is required to retrieve a target video according to a reference video and a modification instruction. To address this task, we develop Visual Representation-Guided Video-LLM Reasoning for Training-Free Composed Video Retrieval. Our framework first uses frozen DINOv3 models to obtain a compact set of visually relevant candidates, and then applies large vision-language models to evaluate whether each candidate satisfies the modification instruction. A final reasoning-based refinement is further performed on the top candidates to improve the first-ranked prediction. Without training, our system achieves 48.78 Recall@1 and 51.48 Recall@5 on the test set. Future work may further improve retrieval accuracy through stronger video-LLMs and detailed integration between visual representations and language reasoning.

Problem

Research questions and friction points this paper is trying to address.

Composed Video Retrieval

Video-LLM

Visual Representation

Training-Free

Modification Instruction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-Free

Composed Video Retrieval

Visual Representation-Guided Reasoning