Collaborative Temporal Consistency Learning for Point-supervised Natural Language Video Localization

📅 2025-03-22

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Natural Language Video Localization (NLVL) faces a trade-off between low annotation cost and fine-grained temporal alignment under point supervision. To address this, we propose Coordinated Temporal Consistency Learning (COTEL), the first framework to jointly model frame-level and snippet-level temporal consistency. COTEL introduces cross-level consistency guidance mechanisms—Frame-level Consistency Guidance (FCG) and Snippet-level Consistency Guidance (SCG)—and a Hierarchical Contrastive Alignment Loss (HCAL), enabling fine-grained language-video temporal alignment using only single-frame annotations. Compared to conventional boundary-supervised paradigms, COTEL substantially reduces annotation overhead. Extensive experiments demonstrate that COTEL achieves state-of-the-art performance on both Charades-STA and ActivityNet Captions benchmarks, outperforming prior methods across all evaluation metrics. These results validate COTEL’s effectiveness and generalizability in achieving high localization accuracy under stringent low-cost annotation constraints.

Technology Category

Application Category

📝 Abstract

Natural language video localization (NLVL) is a crucial task in video understanding that aims to localize the target moment in videos specified by a given language description. Recently, a point-supervised paradigm has been presented to address this task, requiring only a single annotated frame within the target moment rather than complete temporal boundaries. Compared with the fully-supervised paradigm, it offers a balance between localization accuracy and annotation cost. However, due to the absence of complete annotation, it is challenging to align the video content with language descriptions, consequently hindering accurate moment prediction. To address this problem, we propose a new COllaborative Temporal consistEncy Learning (COTEL) framework that leverages the synergy between saliency detection and moment localization to strengthen the video-language alignment. Specifically, we first design a frame- and a segment-level Temporal Consistency Learning (TCL) module that models semantic alignment across frame saliencies and sentence-moment pairs. Then, we design a cross-consistency guidance scheme, including a Frame-level Consistency Guidance (FCG) and a Segment-level Consistency Guidance (SCG), that enables the two temporal consistency learning paths to reinforce each other mutually. Further, we introduce a Hierarchical Contrastive Alignment Loss (HCAL) to comprehensively align the video and text query. Extensive experiments on two benchmarks demonstrate that our method performs favorably against SoTA approaches. We will release all the source codes.

Problem

Research questions and friction points this paper is trying to address.

Align video content with language descriptions accurately

Balance localization accuracy and annotation cost

Leverage saliency detection and moment localization synergy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Collaborative Temporal Consistency Learning framework

Frame- and segment-level Temporal Consistency Learning

Hierarchical Contrastive Alignment Loss

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs