Score: 0

Skeleton-Snippet Contrastive Learning with Multiscale Feature Fusion for Action Localization

Published: December 18, 2025 | arXiv ID: 2512.16504v1

By: Qiushuo Cheng , Jingjing Liu , Catherine Morgan and more

The self-supervised pretraining paradigm has achieved great success in learning 3D action representations for skeleton-based action recognition using contrastive learning. However, learning effective representations for skeleton-based temporal action localization remains challenging and underexplored. Unlike video-level {action} recognition, detecting action boundaries requires temporally sensitive features that capture subtle differences between adjacent frames where labels change. To this end, we formulate a snippet discrimination pretext task for self-supervised pretraining, which densely projects skeleton sequences into non-overlapping segments and promotes features that distinguish them across videos via contrastive learning. Additionally, we build on strong backbones of skeleton-based action recognition models by fusing intermediate features with a U-shaped module to enhance feature resolution for frame-level localization. Our approach consistently improves existing skeleton-based contrastive learning methods for action localization on BABEL across diverse subsets and evaluation protocols. We also achieve state-of-the-art transfer learning performance on PKUMMD with pretraining on NTU RGB+D and BABEL.

MS-CLR: Multi-Skeleton Contrastive Learning for Human Action Recognition

CV and Pattern Recognition

Teaches computers to understand actions from different body poses.

20 Aug 2025 0

90%

Skeletons Speak Louder than Text: A Motion-Aware Pretraining Paradigm for Video-Based Person Re-Identification

CV and Pattern Recognition

Helps computers recognize people in videos by their movement.

17 Nov 2025 2

89%

Label-Efficient Skeleton-based Recognition with Stable-Invertible Graph Convolutional Networks

CV and Pattern Recognition

Teaches computers to recognize actions with less data.

21 Nov 2025 0

View PDF Login to Bookmark

Skeleton-Snippet Contrastive Learning with Multiscale Feature Fusion for Action Localization

Technical Abstract

MS-CLR: Multi-Skeleton Contrastive Learning for Human Action Recognition

Skeletons Speak Louder than Text: A Motion-Aware Pretraining Paradigm for Video-Based Person Re-Identification

Label-Efficient Skeleton-based Recognition with Stable-Invertible Graph Convolutional Networks