Hierarchical Multi-Stage Transformer Architecture for Context-Aware Temporal Action Localization

πŸ“… 2025-07-08
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
For context-aware temporal action localization (TAL) in untrimmed videos, this paper proposes PCL-Formerβ€”a hierarchical, multi-stage Transformer architecture. It decomposes the task into three specialized modules: Proposal-Former for high-quality candidate proposal generation, Classification-Former for fine-grained action classification, and Localization-Former for joint temporal boundary refinement; each employs a dedicated loss function and jointly models spatiotemporal contextual dependencies. Its key innovation lies in the first integration of a task-driven, multi-stage cascaded paradigm with Transformer-based modeling, enabling end-to-end trainable, context-aware localization. Extensive experiments demonstrate state-of-the-art performance on THUMOS-14 (+2.8% mAP), ActivityNet-1.3 (+1.2% mAP), and HACS (+4.8% mAP), validating both superior localization accuracy and strong generalization robustness across diverse benchmarks.

Technology Category

Application Category

πŸ“ Abstract
Inspired by the recent success of transformers and multi-stage architectures in video recognition and object detection domains. We thoroughly explore the rich spatio-temporal properties of transformers within a multi-stage architecture paradigm for the temporal action localization (TAL) task. This exploration led to the development of a hierarchical multi-stage transformer architecture called PCL-Former, where each subtask is handled by a dedicated transformer module with a specialized loss function. Specifically, the Proposal-Former identifies candidate segments in an untrimmed video that may contain actions, the Classification-Former classifies the action categories within those segments, and the Localization-Former precisely predicts the temporal boundaries (i.e., start and end) of the action instances. To evaluate the performance of our method, we have conducted extensive experiments on three challenging benchmark datasets: THUMOS-14, ActivityNet-1.3, and HACS Segments. We also conducted detailed ablation experiments to assess the impact of each individual module of our PCL-Former. The obtained quantitative results validate the effectiveness of the proposed PCL-Former, outperforming state-of-the-art TAL approaches by 2.8%, 1.2%, and 4.8% on THUMOS14, ActivityNet-1.3, and HACS datasets, respectively.
Problem

Research questions and friction points this paper is trying to address.

Develops transformer architecture for temporal action localization
Identifies and classifies action segments in videos
Improves accuracy of action boundary prediction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical multi-stage transformer for TAL
Dedicated transformer modules per subtask
Specialized loss functions for each module
πŸ”Ž Similar Papers
No similar papers found.