🤖 AI Summary
To address the challenges of insufficient multi-scale contextual modeling and difficulty in capturing high-order joint correlations in skeleton-based action recognition, this paper proposes an autoregressive vector-quantized hypergraph learning framework. Methodologically, it introduces a novel autoregressive vectorized hypergraph generation mechanism, integrated with model-agnostic adaptive hyperedge construction, to enable joint spatio-temporal-channel feature learning. A Transformer-based architecture is incorporated to enhance long-range dependency modeling, and a supervised-unsupervised hybrid training paradigm is designed to jointly optimize action-specific representations across spatial, temporal, and channel dimensions. The framework achieves state-of-the-art performance on NTU RGB+D, NTU RGB+D 120, and NW-UCLA benchmarks, significantly outperforming existing hypergraph-based approaches. Ablation studies confirm the effectiveness and complementary nature of each component.
📝 Abstract
Extracting multiscale contextual information and higher-order correlations among skeleton sequences using Graph Convolutional Networks (GCNs) alone is inadequate for effective action classification. Hypergraph convolution addresses the above issues but cannot harness the long-range dependencies. The transformer proves to be effective in capturing these dependencies and making complex contextual features accessible. We propose an Autoregressive Adaptive HyperGraph Transformer (AutoregAd-HGformer) model for in-phase (autoregressive and discrete) and out-phase (adaptive) hypergraph generation. The vector quantized in-phase hypergraph equipped with powerful autoregressive learned priors produces a more robust and informative representation suitable for hyperedge formation. The out-phase hypergraph generator provides a model-agnostic hyperedge learning technique to align the attributes with input skeleton embedding. The hybrid (supervised and unsupervised) learning in AutoregAd-HGformer explores the action-dependent feature along spatial, temporal, and channel dimensions. The extensive experimental results and ablation study indicate the superiority of our model over state-of-the-art hypergraph architectures on the NTU RGB+D, NTU RGB+D 120, and NW-UCLA datasets.