🤖 AI Summary
This work addresses the performance limitations of existing CNNs and Transformers in long-form video action detection, which stem from feature redundancy and insufficient global temporal modeling. To overcome these challenges, the authors propose an Efficient Spatio-Temporal Focusing Adapter (ESTF Adapter), seamlessly integrated between layers of a pretrained vision backbone, coupled with a Temporal Boundary-aware State Space Model (TB-SSM) for precise temporal modeling. This approach significantly enhances the capacity to capture long-range dependencies and accurately localize action boundaries while maintaining computational efficiency. Experimental results demonstrate that the proposed framework consistently outperforms current state space models and other leading methods across multiple benchmarks, achieving notable improvements in both action localization accuracy and robustness.
📝 Abstract
Temporal human action detection aims to identify and localize action segments within untrimmed videos, serving as a pivotal task in video understanding. Despite the progress achieved by prior architectures like CNN and Transformer models, these continue to struggle with feature redundancy and degraded global dependency modeling capabilities when applied to long video sequences. These limitations severely constrain their scalability in real-world video analysis. State Space Models (SSMs) offer a promising alternative with linear long-term modeling and robust global temporal reasoning capabilities. Rethinking the application of SSMs in temporal modeling, this research constructs a novel framework for video human action detection. Specifically, we introduce the Efficient Spatial-Temporal Focal (ESTF) Adapter into the pre-trained layers. This module integrates the advantages of our proposed Temporal Boundary-aware SSM(TB-SSM) for temporal feature modeling with efficient processing of spatial features. We perform comprehensive and quantitative analyses across multiple benchmarks, comparing our proposed method against previous SSM-based and other structural methods. Extensive experiments demonstrate that our improved strategy significantly enhances both localization performance and robustness, validating the effectiveness of our proposed method.