TSGCNeXt: Dynamic-Static Multi-Graph Convolution for Efficient Skeleton-Based Action Recognition with Long-term Learning Potential

📅 2023-04-23

🏛️ arXiv.org

📈 Citations: 8

✨ Influential: 1

career value

196K/year

🤖 AI Summary

To address structural redundancy, inefficient dynamic graph learning, and insufficient long-term temporal modeling in GCN-based skeleton-based action recognition, this paper proposes a Dynamic-Static Separated Multi-Graph Convolution (DS-SMG) mechanism to decouple spatiotemporal dependency modeling. We further design a backpropagation acceleration strategy for graph convolution, achieving a 55.08% training speedup. Additionally, we introduce a three-module spatiotemporal learning architecture integrating Exponential Moving Average (EMA)-based multi-stream fusion and a lightweight ConvNeXt-style backbone. On the NTU-120 dataset, the single-stream model achieves state-of-the-art (SOTA) accuracy of 90.22% (cross-subject) and 91.74% (cross-setup), while the multi-stream variant attains industry-leading performance.

📝 Abstract

Skeleton-based action recognition has achieved remarkable results in human action recognition with the development of graph convolutional networks (GCNs). However, the recent works tend to construct complex learning mechanisms with redundant training and exist a bottleneck for long time-series. To solve these problems, we propose the Temporal-Spatio Graph ConvNeXt (TSGCNeXt) to explore efficient learning mechanism of long temporal skeleton sequences. Firstly, a new graph learning mechanism with simple structure, Dynamic-Static Separate Multi-graph Convolution (DS-SMG) is proposed to aggregate features of multiple independent topological graphs and avoid the node information being ignored during dynamic convolution. Next, we construct a graph convolution training acceleration mechanism to optimize the back-propagation computing of dynamic graph learning with 55.08% speed-up. Finally, the TSGCNeXt restructure the overall structure of GCN with three Spatio-temporal learning modules,efficiently modeling long temporal features. In comparison with existing previous methods on large-scale datasets NTU RGB+D 60 and 120, TSGCNeXt outperforms on single-stream networks. In addition, with the ema model introduced into the multi-stream fusion, TSGCNeXt achieves SOTA levels. On the cross-subject and cross-set of the NTU 120, accuracies reach 90.22% and 91.74%.

Problem

Research questions and friction points this paper is trying to address.

Efficient learning for long skeleton action sequences

Overcoming complex redundant training mechanisms

Addressing long time-series modeling bottlenecks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic-Static Separate Multi-graph Convolution mechanism

Graph convolution training acceleration mechanism

Three Spatio-temporal learning modules structure

🔎 Similar Papers

No similar papers found.