UST-SSM: Unified Spatio-Temporal State Space Models for Point Cloud Video Modeling

📅 2025-08-20

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

Point cloud videos mitigate interference from illumination and viewpoint variations but suffer from spatiotemporal disorder, hindering unidirectional sequence modeling. To address this, we propose a unified Spatiotemporal State Space Model (ST-SSM) that enables ordered modeling of disordered point cloud sequences via three key components: (1) a semantic-aware spatiotemporal selective scanning strategy, (2) a structured feature aggregation mechanism, and (3) non-anchor frame temporal interaction sampling. ST-SSM efficiently captures long-range geometric similarities and fine-grained motion dependencies. Furthermore, we innovatively integrate prompt-guided clustering with dynamic 4D spatiotemporal feature aggregation, significantly enhancing joint representation of geometric structure and motion patterns. Evaluated on MSR-Action3D, NTU RGB+D, and Synthia 4D benchmarks, our method achieves state-of-the-art performance with substantially fewer parameters, yielding notable improvements in action recognition accuracy.

Technology Category

Application Category

📝 Abstract

Point cloud videos capture dynamic 3D motion while reducing the effects of lighting and viewpoint variations, making them highly effective for recognizing subtle and continuous human actions. Although Selective State Space Models (SSMs) have shown good performance in sequence modeling with linear complexity, the spatio-temporal disorder of point cloud videos hinders their unidirectional modeling when directly unfolding the point cloud video into a 1D sequence through temporally sequential scanning. To address this challenge, we propose the Unified Spatio-Temporal State Space Model (UST-SSM), which extends the latest advancements in SSMs to point cloud videos. Specifically, we introduce Spatial-Temporal Selection Scanning (STSS), which reorganizes unordered points into semantic-aware sequences through prompt-guided clustering, thereby enabling the effective utilization of points that are spatially and temporally distant yet similar within the sequence. For missing 4D geometric and motion details, Spatio-Temporal Structure Aggregation (STSA) aggregates spatio-temporal features and compensates. To improve temporal interaction within the sampled sequence, Temporal Interaction Sampling (TIS) enhances fine-grained temporal dependencies through non-anchor frame utilization and expanded receptive fields. Experimental results on the MSR-Action3D, NTU RGB+D, and Synthia 4D datasets validate the effectiveness of our method. Our code is available at https://github.com/wangzy01/UST-SSM.

Problem

Research questions and friction points this paper is trying to address.

Modeling spatiotemporal disorder in point cloud videos

Capturing 4D geometric and motion details effectively

Enhancing temporal interactions within sampled sequences

Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatial-Temporal Selection Scanning organizes points into semantic sequences

Spatio-Temporal Structure Aggregation compensates missing geometric details

Temporal Interaction Sampling enhances fine-grained temporal dependencies

🔎 Similar Papers

MAMBA4D: Efficient Long-Sequence Point Cloud Video Understanding with Disentangled Spatial-Temporal State Space Models