SCANet: Scene Complexity Aware Network for Weakly-Supervised Video Moment Retrieval

📅 2023-10-01

🏛️ IEEE International Conference on Computer Vision

📈 Citations: 9

✨ Influential: 2

🤖 AI Summary

To address the limitation of fixed-number proposal generation in weakly supervised video moment retrieval—where a static proposal count fails to adapt to varying scene complexities across videos—this paper introduces a scene complexity modeling framework. We first formulate a learnable, video-level scene complexity metric and design a dedicated complexity prediction network that guides both multi-scale temporal modeling and adaptive proposal generation. Furthermore, fine-grained text-video alignment is achieved via contrastive cross-modal matching. Evaluated on three major benchmarks—Charades-STA, ActivityNet Captions, and TVR—our approach achieves state-of-the-art performance, with substantial improvements in mean Average Precision (mAP). These results empirically validate that explicit scene complexity awareness significantly enhances proposal quality and model generalization in weakly supervised settings.

📝 Abstract

Video moment retrieval aims to localize moments in video corresponding to a given language query. To avoid the expensive cost of annotating the temporal moments, weakly-supervised VMR (wsVMR) systems have been studied. For such systems, generating a number of proposals as moment candidates and then selecting the most appropriate proposal has been a popular approach. These proposals are assumed to contain many distinguishable scenes in a video as candidates. However, existing proposals of wsVMR systems do not respect the varying numbers of scenes in each video, where the proposals are heuristically determined irrespective of the video. We argue that the retrieval system should be able to counter the complexities caused by varying numbers of scenes in each video. To this end, we present a novel concept of a retrieval system referred to as Scene Complexity Aware Network (SCANet), which measures the ‘scene complexity' of multiple scenes in each video and generates adaptive proposals responding to variable complexities of scenes in each video. Experimental results on three retrieval benchmarks (i.e. Charades-STA, ActivityNet, TVR) achieve state-of-the-art performances and demonstrate the effectiveness of incorporating the scene complexity.

Problem

Research questions and friction points this paper is trying to address.

Weakly-supervised video moment retrieval without temporal annotations

Adaptive proposals for varying scene complexities in videos

Improving retrieval accuracy by measuring scene complexity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive proposals based on scene complexity

Measures varying scene complexities in videos

State-of-the-art performance on benchmarks

🔎 Similar Papers

No similar papers found.

Authors to Follow