🤖 AI Summary
In partial relevant video retrieval (PRVR), ambiguous queries and local video relevance often lead to spurious semantic alignments. To address this, we propose a probabilistic cross-modal alignment framework. Our method models both queries and video segments as multivariate Gaussian distributions to explicitly capture semantic uncertainty. We further introduce a learnable confidence gating mechanism that dynamically weights token-level similarities and supports proxy-level matching, thereby enhancing fine-grained, multimodal cross-modal correspondence modeling. The framework is plug-and-play and compatible with mainstream retrieval architectures. Extensive experiments across multiple benchmark datasets and diverse backbone networks demonstrate significant improvements in retrieval accuracy and robustness. Notably, our approach exhibits superior generalization under high-ambiguity and strong-noise conditions, outperforming existing methods in challenging PRVR scenarios.
📝 Abstract
Partially Relevant Video Retrieval (PRVR) aims to retrieve untrimmed videos partially relevant to a given query. The core challenge lies in learning robust query-video alignment against spurious semantic correlations arising from inherent data uncertainty: 1) query ambiguity, where the query incompletely characterizes the target video and often contains uninformative tokens, and 2) partial video relevance, where abundant query-irrelevant segments introduce contextual noise in cross-modal alignment. Existing methods often focus on enhancing multi-scale clip representations and retrieving the most relevant clip. However, the inherent data uncertainty in PRVR renders them vulnerable to distractor videos with spurious similarities, leading to suboptimal performance. To fill this research gap, we propose Robust Alignment Learning (RAL) framework, which explicitly models the uncertainty in data. Key innovations include: 1) we pioneer probabilistic modeling for PRVR by encoding videos and queries as multivariate Gaussian distributions. This not only quantifies data uncertainty but also enables proxy-level matching to capture the variability in cross-modal correspondences; 2) we consider the heterogeneous informativeness of query words and introduce learnable confidence gates to dynamically weight similarity. As a plug-and-play solution, RAL can be seamlessly integrated into the existing architectures. Extensive experiments across diverse retrieval backbones demonstrate its effectiveness.