Score: 1

MaskSem: Semantic-Guided Masking for Learning 3D Hybrid High-Order Motion Representation

Published: August 18, 2025 | arXiv ID: 2508.12948v1

By: Wei Wei , Shaojie Zhang , Yonghao Dang and more

Potential Business Impact:

Helps robots understand human movements better.

Human action recognition is a crucial task for intelligent robotics, particularly within the context of human-robot collaboration research. In self-supervised skeleton-based action recognition, the mask-based reconstruction paradigm learns the spatial structure and motion patterns of the skeleton by masking joints and reconstructing the target from unlabeled data. However, existing methods focus on a limited set of joints and low-order motion patterns, limiting the model's ability to understand complex motion patterns. To address this issue, we introduce MaskSem, a novel semantic-guided masking method for learning 3D hybrid high-order motion representations. This novel framework leverages Grad-CAM based on relative motion to guide the masking of joints, which can be represented as the most semantically rich temporal orgions. The semantic-guided masking process can encourage the model to explore more discriminative features. Furthermore, we propose using hybrid high-order motion as the reconstruction target, enabling the model to learn multi-order motion patterns. Specifically, low-order motion velocity and high-order motion acceleration are used together as the reconstruction target. This approach offers a more comprehensive description of the dynamic motion process, enhancing the model's understanding of motion patterns. Experiments on the NTU60, NTU120, and PKU-MMD datasets show that MaskSem, combined with a vanilla transformer, improves skeleton-based action recognition, making it more suitable for applications in human-robot interaction.

Informative Sample Selection Model for Skeleton-based Action Recognition with Limited Training Samples

CV and Pattern Recognition

Teaches computers to recognize actions with less data.

29 Oct 2025 0

89%

Grounding Foundational Vision Models with 3D Human Poses for Robust Action Recognition

CV and Pattern Recognition

Teaches robots to understand actions by watching.

6 Nov 2025 2

88%

Heterogeneous Skeleton-Based Action Representation Learning

CV and Pattern Recognition

Helps robots understand actions from different body shapes.

4 Jun 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Repos / Data Links

github.com

Page Count

8 pages

MaskSem: Semantic-Guided Masking for Learning 3D Hybrid High-Order Motion Representation

Helps robots understand human movements better.

Technical Abstract

Informative Sample Selection Model for Skeleton-based Action Recognition with Limited Training Samples

Grounding Foundational Vision Models with 3D Human Poses for Robust Action Recognition

Heterogeneous Skeleton-Based Action Representation Learning