MambaTAD: When State-Space Models Meet Long-Range Temporal Action Detection

📅 2025-11-22

📈 Citations: 0

✨ Influential: 0

career value

285K/year

🤖 AI Summary

Long-duration action detection (TAD) in untrimmed videos faces critical challenges including temporal context decay, intra-frame self-interference, and insufficient global perception. To address these, this paper proposes an end-to-end, one-stage TAD framework. Our method introduces three key innovations: (1) a Diagonal-Masked Bidirectional State Space (DMBSS) module to mitigate temporal decay and cross-frame self-interference in recurrent modeling; (2) a State Space Temporal Adapter (SSTA) to enhance long-range temporal dependency modeling; and (3) a multi-granularity global feature fusion detection head to improve global contextual awareness and boundary localization accuracy. Extensive experiments demonstrate that our approach achieves new state-of-the-art performance on standard benchmarks—including THUMOS14 and ActivityNet v1.3—while simultaneously reducing computational cost and model parameters. This work establishes a more efficient and effective paradigm for long-duration action detection in untrimmed videos.

Technology Category

Application Category

📝 Abstract

Temporal Action Detection (TAD) aims to identify and localize actions by determining their starting and ending frames within untrimmed videos. Recent Structured State-Space Models such as Mamba have demonstrated potential in TAD due to their long-range modeling capability and linear computational complexity. On the other hand, structured state-space models often face two key challenges in TAD, namely, decay of temporal context due to recursive processing and self-element conflict during global visual context modeling, which become more severe while handling long-span action instances. Additionally, traditional methods for TAD struggle with detecting long-span action instances due to a lack of global awareness and inefficient detection heads. This paper presents MambaTAD, a new state-space TAD model that introduces long-range modeling and global feature detection capabilities for accurate temporal action detection. MambaTAD comprises two novel designs that complement each other with superior TAD performance. First, it introduces a Diagonal-Masked Bidirectional State-Space (DMBSS) module which effectively facilitates global feature fusion and temporal action detection. Second, it introduces a global feature fusion head that refines the detection progressively with multi-granularity features and global awareness. In addition, MambaTAD tackles TAD in an end-to-end one-stage manner using a new state-space temporal adapter(SSTA) which reduces network parameters and computation cost with linear complexity. Extensive experiments show that MambaTAD achieves superior TAD performance consistently across multiple public benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Detecting long-span actions in videos with global awareness

Overcoming temporal context decay in state-space models

Improving efficiency of temporal action detection heads

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diagonal-Masked Bidirectional State-Space module for global fusion

Global feature fusion head with multi-granularity refinement

End-to-end state-space temporal adapter with linear complexity

🔎 Similar Papers

DyGMamba: Efficiently Modeling Long-Term Temporal Dependency on Continuous-Time Dynamic Graphs with State Space Models