🤖 AI Summary
Point cloud videos mitigate interference from illumination and viewpoint variations but suffer from spatiotemporal disorder, hindering unidirectional sequence modeling. To address this, we propose a unified Spatiotemporal State Space Model (ST-SSM) that enables ordered modeling of disordered point cloud sequences via three key components: (1) a semantic-aware spatiotemporal selective scanning strategy, (2) a structured feature aggregation mechanism, and (3) non-anchor frame temporal interaction sampling. ST-SSM efficiently captures long-range geometric similarities and fine-grained motion dependencies. Furthermore, we innovatively integrate prompt-guided clustering with dynamic 4D spatiotemporal feature aggregation, significantly enhancing joint representation of geometric structure and motion patterns. Evaluated on MSR-Action3D, NTU RGB+D, and Synthia 4D benchmarks, our method achieves state-of-the-art performance with substantially fewer parameters, yielding notable improvements in action recognition accuracy.
📝 Abstract
Point cloud videos capture dynamic 3D motion while reducing the effects of lighting and viewpoint variations, making them highly effective for recognizing subtle and continuous human actions. Although Selective State Space Models (SSMs) have shown good performance in sequence modeling with linear complexity, the spatio-temporal disorder of point cloud videos hinders their unidirectional modeling when directly unfolding the point cloud video into a 1D sequence through temporally sequential scanning. To address this challenge, we propose the Unified Spatio-Temporal State Space Model (UST-SSM), which extends the latest advancements in SSMs to point cloud videos. Specifically, we introduce Spatial-Temporal Selection Scanning (STSS), which reorganizes unordered points into semantic-aware sequences through prompt-guided clustering, thereby enabling the effective utilization of points that are spatially and temporally distant yet similar within the sequence. For missing 4D geometric and motion details, Spatio-Temporal Structure Aggregation (STSA) aggregates spatio-temporal features and compensates. To improve temporal interaction within the sampled sequence, Temporal Interaction Sampling (TIS) enhances fine-grained temporal dependencies through non-anchor frame utilization and expanded receptive fields. Experimental results on the MSR-Action3D, NTU RGB+D, and Synthia 4D datasets validate the effectiveness of our method. Our code is available at https://github.com/wangzy01/UST-SSM.