UST-SSM: Unified Spatio-Temporal State Space Models for Point Cloud Video Modeling

📅 2025-08-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Point cloud videos mitigate interference from illumination and viewpoint variations but suffer from spatiotemporal disorder, hindering unidirectional sequence modeling. To address this, we propose a unified Spatiotemporal State Space Model (ST-SSM) that enables ordered modeling of disordered point cloud sequences via three key components: (1) a semantic-aware spatiotemporal selective scanning strategy, (2) a structured feature aggregation mechanism, and (3) non-anchor frame temporal interaction sampling. ST-SSM efficiently captures long-range geometric similarities and fine-grained motion dependencies. Furthermore, we innovatively integrate prompt-guided clustering with dynamic 4D spatiotemporal feature aggregation, significantly enhancing joint representation of geometric structure and motion patterns. Evaluated on MSR-Action3D, NTU RGB+D, and Synthia 4D benchmarks, our method achieves state-of-the-art performance with substantially fewer parameters, yielding notable improvements in action recognition accuracy.

Technology Category

Application Category

📝 Abstract
Point cloud videos capture dynamic 3D motion while reducing the effects of lighting and viewpoint variations, making them highly effective for recognizing subtle and continuous human actions. Although Selective State Space Models (SSMs) have shown good performance in sequence modeling with linear complexity, the spatio-temporal disorder of point cloud videos hinders their unidirectional modeling when directly unfolding the point cloud video into a 1D sequence through temporally sequential scanning. To address this challenge, we propose the Unified Spatio-Temporal State Space Model (UST-SSM), which extends the latest advancements in SSMs to point cloud videos. Specifically, we introduce Spatial-Temporal Selection Scanning (STSS), which reorganizes unordered points into semantic-aware sequences through prompt-guided clustering, thereby enabling the effective utilization of points that are spatially and temporally distant yet similar within the sequence. For missing 4D geometric and motion details, Spatio-Temporal Structure Aggregation (STSA) aggregates spatio-temporal features and compensates. To improve temporal interaction within the sampled sequence, Temporal Interaction Sampling (TIS) enhances fine-grained temporal dependencies through non-anchor frame utilization and expanded receptive fields. Experimental results on the MSR-Action3D, NTU RGB+D, and Synthia 4D datasets validate the effectiveness of our method. Our code is available at https://github.com/wangzy01/UST-SSM.
Problem

Research questions and friction points this paper is trying to address.

Modeling spatiotemporal disorder in point cloud videos
Capturing 4D geometric and motion details effectively
Enhancing temporal interactions within sampled sequences
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatial-Temporal Selection Scanning organizes points into semantic sequences
Spatio-Temporal Structure Aggregation compensates missing geometric details
Temporal Interaction Sampling enhances fine-grained temporal dependencies
🔎 Similar Papers
No similar papers found.
P
Peiming Li
State Key Laboratory of General Artificial Intelligence, Peking University, Shenzhen Graduate School
Z
Ziyi Wang
State Key Laboratory of General Artificial Intelligence, Peking University, Shenzhen Graduate School
Y
Yulin Yuan
The Zhejiang University-University of Illinois Urbana-Champaign Institute, Zhejiang University
H
Hong Liu
State Key Laboratory of General Artificial Intelligence, Peking University, Shenzhen Graduate School
Xiangming Meng
Xiangming Meng
The Zhejiang University-University of Illinois Urbana-Champaign Institute, Zhejiang University
machine learningsignal processingBayesian inference
Junsong Yuan
Junsong Yuan
State University of New York at Buffalo
computer visionvideo analyticsaction and gesture analysismultimediapattern recognition
M
Mengyuan Liu
State Key Laboratory of General Artificial Intelligence, Peking University, Shenzhen Graduate School