Static and Dynamic Graph Alignment Network for Temporal Video Grounding

📅 2026-05-01
📈 Citations: 0
Influential: 0
📄 PDF

career value

178K/year
🤖 AI Summary
Existing graph convolutional network–based approaches for temporal video grounding suffer from limitations in visual representation completeness, query-aware graph construction, and semantic matching granularity, which hinder localization accuracy. To address these issues, this work proposes a complementary temporal graph construction mechanism that fuses static and dynamic visual features, along with a query-guided adaptive graph modeling framework and a query-clip contrastive learning strategy to enhance query-aware representations. Furthermore, the method incorporates multi-granularity temporal proposals and a progressive easy-to-hard training scheme to refine semantic alignment and facilitate model convergence. Evaluated on three benchmark datasets, the proposed approach significantly outperforms current state-of-the-art methods and achieves leading performance in complex temporal grounding tasks.
📝 Abstract
Temporal Video Grounding (TVG) aims to localize temporal moments in an untrimmed video that semantically correspond to given natural language queries. Recently, Graph Convolutional Networks (GCN) have been widely adopted in TVG to model temporal relations among video clips and enhance contextual reasoning by constructing clip-level graphs. Despite their effectiveness, existing GCN-based TVG methods encounter three critical bottlenecks: 1) Most methods construct graph nodes using either static or dynamic features alone, resulting in incomplete visual representation and overlooking complementary semantics, 2) Most methods construct temporal graphs in a query-agnostic manner, leading to inefficient feature interaction within the temporal graph representation, and 3) Most methods often suffer from a single-granularity semantic matching, while direct training on complex temporal localization task may lead to slow convergence and suboptimal precision. To address these challenges, we propose Static and Dynamic Graph Alignment Network (SDGAN). First, SDGAN jointly exploits static and dynamic visual features to construct two complementary temporal graphs and performs Position-wise Nodes Alignment, enabling more expressive and robust visual representation. Second, SDGAN introduces Query-Clip Contrastive Learning and Adaptive Graph Modeling to explicitly align visual clips with their corresponding textual queries, yielding query-aware visual representations. Third, SDGAN incorporates multi-granularity temporal proposals within Progressive Easy-to-Hard Training Strategy, effectively bridging coarse-grained semantic localization and fine-grained temporal boundary refinement. Extensive experiments on three benchmark datasets demonstrate that SDGAN achieves superior performance across complex TVG scenarios. Codes and datasets are available at https://github.com/ZhanJieHu/SDGAN.
Problem

Research questions and friction points this paper is trying to address.

Temporal Video Grounding
Graph Convolutional Networks
Static and Dynamic Features
Query-Aware Representation
Multi-Granularity Matching
Innovation

Methods, ideas, or system contributions that make the work stand out.

Static and Dynamic Graph Alignment
Query-Aware Visual Representation
Multi-Granularity Temporal Proposals
Progressive Easy-to-Hard Training
Temporal Video Grounding
🔎 Similar Papers
2024-03-21IEEE Transactions on Pattern Analysis and Machine IntelligenceCitations: 2
Z
Zhanjie Hu
Faculty of Electronic Engineering and Computer Science, Ningbo University, Ningbo, China
B
Bolin Zhang
Faculty of Electronic Engineering and Computer Science, Ningbo University, Ningbo, China
J
Jianhua Wang
College of Computer Science, Inner Mongolia University, Hohhot, China
J
Jianbo Zheng
College of Computing and Engineering, Hunan Normal University, Changsha, China
C
Chenchen Yan
Faculty of Computing, Georg-August-Universität Göttingen, Germany
Takahiro Komamizu
Takahiro Komamizu
Nagoya University
LODSemantic WebImbalanced ClassificationApplication of NLP
Ichiro Ide
Ichiro Ide
Professor of Informatics, Nagoya University
Image & Video AnalysisNatural Language ProcessingInformation RetrievalMultimedia Contents Analysis & Authoring
J
Jiangbo Qian
Merchants’ Guild Economics and Cultural Intelligent Computing Laboratory, Ningbo University, Ningbo 315211, China