đ¤ AI Summary
This work addresses the challenges of incomplete candidate recall and unstable ranking in compositional video retrieval by proposing a decoupled dual-path architecture that separates recall and reranking. The approach first generates a Top-10 candidate set through parallel text and visual pathways, followed by a conservative pairwise reranking mechanism based on a vision-language model (VLM). This strategy avoids direct multi-candidate VLM classification or large-scale textual reranking, instead leveraging a VLM-based slot selector, DFN-H/DFN-L contact map embeddings, and a dual-path fusion scheme to significantly enhance ranking stability and accuracy. Evaluated on a hidden test set, the method achieves state-of-the-art performance with R@1 of 95.28, R@5 of 97.47, R@10 of 98.48, and R@50 of 99.66.
đ Abstract
We describe \emph{Dual-Route Top-K Retrieval with 1v1 VLM Reranking} for the CoVR-R challenge. The method treats composed video retrieval as two coupled problems: finding a sufficiently complete top-k candidate set, and then safely deciding whether any candidate should replace a strong current top-1. We first improve the reasoning/text seed with a VLM slot selector over existing candidates, without introducing DFN visual retrieval. We then add a visual route from contact-sheet embeddings using DFN-H/DFN-L. The routes are merged into a top-10 candidate set, after which a VLM final reranker performs conservative 1v1 comparisons between the current top-1 and each challenger. On the hidden test split, the final system reaches 95.28 R@1, 97.47 R@5, 98.48 R@10, and 99.66 R@50. The main lesson is that CoVR-R benefits more from recall-selection decoupling than from broad text reranking or direct multi-candidate VLM classification.