Towards Efficient Partially Relevant Video Retrieval with Active Moment Discovering

📅 2025-04-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Partial Relevant Video Retrieval (PRVR) aims to model fine-grained alignment between text queries and local segments in untrimmed videos, yet existing methods suffer from overly independent segment representations and severe background redundancy. To address these limitations, we propose AMDNet: (1) a learnable span anchor mechanism that adaptively localizes candidate temporal intervals; (2) masked multi-moment attention to capture cross-segment semantic dependencies; and (3) joint optimization of moment diversity loss, relevance loss, and end-to-end PRVR loss. On TVR, AMDNet achieves a +6.0 SumR gain over prior work while using only 1/15.5 the parameters of GMMFormer. It also establishes new state-of-the-art performance on ActivityNet Captions. The method is computationally efficient, exhibits strong generalization across datasets, and provides interpretable temporal localization via learned anchors.

Technology Category

Application Category

📝 Abstract
Partially relevant video retrieval (PRVR) is a practical yet challenging task in text-to-video retrieval, where videos are untrimmed and contain much background content. The pursuit here is of both effective and efficient solutions to capture the partial correspondence between text queries and untrimmed videos. Existing PRVR methods, which typically focus on modeling multi-scale clip representations, however, suffer from content independence and information redundancy, impairing retrieval performance. To overcome these limitations, we propose a simple yet effective approach with active moment discovering (AMDNet). We are committed to discovering video moments that are semantically consistent with their queries. By using learnable span anchors to capture distinct moments and applying masked multi-moment attention to emphasize salient moments while suppressing redundant backgrounds, we achieve more compact and informative video representations. To further enhance moment modeling, we introduce a moment diversity loss to encourage different moments of distinct regions and a moment relevance loss to promote semantically query-relevant moments, which cooperate with a partially relevant retrieval loss for end-to-end optimization. Extensive experiments on two large-scale video datasets (ie, TVR and ActivityNet Captions) demonstrate the superiority and efficiency of our AMDNet. In particular, AMDNet is about 15.5 times smaller (#parameters) while 6.0 points higher (SumR) than the up-to-date method GMMFormer on TVR.
Problem

Research questions and friction points this paper is trying to address.

Efficient retrieval of partially relevant untrimmed videos
Overcoming content independence and redundancy in PRVR
Enhancing video representation with active moment discovery
Innovation

Methods, ideas, or system contributions that make the work stand out.

Active moment discovering for semantic consistency
Masked multi-moment attention for salient moments
Diversity and relevance losses for optimization
🔎 Similar Papers
No similar papers found.
Peipei Song
Peipei Song
University of Science and Technology of China
MultimediaComputer VisionMachine Learning
L
Long Zhang
School of Information Science and Technology, USTC, Hefei 230026, China
L
Long Lan
Institute for Quantum Information, and the State Key Laboratory of High Performance Computing, National University of Defense Technology (NUDT), Changsha, 410073, China
W
Weidong Chen
School of Information Science and Technology, USTC, Hefei 230026, China
Dan Guo
Dan Guo
IEEE senior member, Professor, Hefei University of Technology
Multimedia ComputingArtificial Intelligence
X
Xun Yang
School of Information Science and Technology, USTC, Hefei 230026, China; MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, USTC
M
Meng Wang
Key Laboratory of Knowledge Engineering with Big Data (HFUT), Ministry of Education and School of Computer Science and Information Engineering, Hefei University of Technology (HFUT), Hefei, 230601, China, and Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, 230026, China