UST-SSM: Unified Spatio-Temporal State Space Models for Point Cloud Video Modeling
By: Peiming Li , Ziyi Wang , Yulin Yuan and more
Potential Business Impact:
Helps computers understand 3D movement from messy data.
Point cloud videos capture dynamic 3D motion while reducing the effects of lighting and viewpoint variations, making them highly effective for recognizing subtle and continuous human actions. Although Selective State Space Models (SSMs) have shown good performance in sequence modeling with linear complexity, the spatio-temporal disorder of point cloud videos hinders their unidirectional modeling when directly unfolding the point cloud video into a 1D sequence through temporally sequential scanning. To address this challenge, we propose the Unified Spatio-Temporal State Space Model (UST-SSM), which extends the latest advancements in SSMs to point cloud videos. Specifically, we introduce Spatial-Temporal Selection Scanning (STSS), which reorganizes unordered points into semantic-aware sequences through prompt-guided clustering, thereby enabling the effective utilization of points that are spatially and temporally distant yet similar within the sequence. For missing 4D geometric and motion details, Spatio-Temporal Structure Aggregation (STSA) aggregates spatio-temporal features and compensates. To improve temporal interaction within the sampled sequence, Temporal Interaction Sampling (TIS) enhances fine-grained temporal dependencies through non-anchor frame utilization and expanded receptive fields. Experimental results on the MSR-Action3D, NTU RGB+D, and Synthia 4D datasets validate the effectiveness of our method. Our code is available at https://github.com/wangzy01/UST-SSM.
Similar Papers
USTM: Unified Spatial and Temporal Modeling for Continuous Sign Language Recognition
CV and Pattern Recognition
Helps computers understand sign language from videos.
Spatio-Temporal State Space Model For Efficient Event-Based Optical Flow
CV and Pattern Recognition
Makes robots see fast and move smoothly.
MV-SSM: Multi-View State Space Modeling for 3D Human Pose Estimation
CV and Pattern Recognition
Helps computers see people's bodies from many cameras.