Narrating the Video: Boosting Text-Video Retrieval via Comprehensive Utilization of Frame-Level Captions

📅 2025-03-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-to-video retrieval methods rely on generative captions for auxiliary supervision but struggle to model fine-grained temporal semantics and are susceptible to caption noise, leading to retrieval bias. To address these limitations, we propose NarVid, the first framework to systematically exploit frame-level narrations for their temporal structure and semantic granularity. Specifically, NarVid (1) introduces narration-aware cross-modal feature enhancement to achieve precise frame-level semantic alignment; (2) incorporates a query-aware gated caption filtering mechanism to suppress noisy or irrelevant caption tokens; and (3) jointly optimizes a dual-modal matching score with multi-view hard negative contrastive learning. Extensive experiments demonstrate that NarVid achieves significant improvements over state-of-the-art methods on standard benchmarks including MSR-VTT and ActivityNet, delivering consistent gains in both retrieval accuracy and robustness.

Technology Category

Application Category

📝 Abstract
In recent text-video retrieval, the use of additional captions from vision-language models has shown promising effects on the performance. However, existing models using additional captions often have struggled to capture the rich semantics, including temporal changes, inherent in the video. In addition, incorrect information caused by generative models can lead to inaccurate retrieval. To address these issues, we propose a new framework, Narrating the Video (NarVid), which strategically leverages the comprehensive information available from frame-level captions, the narration. The proposed NarVid exploits narration in multiple ways: 1) feature enhancement through cross-modal interactions between narration and video, 2) query-aware adaptive filtering to suppress irrelevant or incorrect information, 3) dual-modal matching score by adding query-video similarity and query-narration similarity, and 4) hard-negative loss to learn discriminative features from multiple perspectives using the two similarities from different views. Experimental results demonstrate that NarVid achieves state-of-the-art performance on various benchmark datasets.
Problem

Research questions and friction points this paper is trying to address.

Enhance text-video retrieval using frame-level captions.
Address incorrect information from generative models.
Improve semantic capture of temporal video changes.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal interaction enhances video features.
Query-aware filtering removes incorrect information.
Dual-modal matching improves retrieval accuracy.
🔎 Similar Papers
No similar papers found.