Score: 1

Patch as Node: Human-Centric Graph Representation Learning for Multimodal Action Recognition

Published: December 26, 2025 | arXiv ID: 2512.21916v1

By: Zeyu Liang, Hailun Xia, Naichuan Zheng

Potential Business Impact:

Makes computers understand actions from videos better.

Business Areas:

Motion Capture Media and Entertainment, Video

While human action recognition has witnessed notable achievements, multimodal methods fusing RGB and skeleton modalities still suffer from their inherent heterogeneity and fail to fully exploit the complementary potential between them. In this paper, we propose PAN, the first human-centric graph representation learning framework for multimodal action recognition, in which token embeddings of RGB patches containing human joints are represented as spatiotemporal graphs. The human-centric graph modeling paradigm suppresses the redundancy in RGB frames and aligns well with skeleton-based methods, thus enabling a more effective and semantically coherent fusion of multimodal features. Since the sampling of token embeddings heavily relies on 2D skeletal data, we further propose attention-based post calibration to reduce the dependency on high-quality skeletal data at a minimal cost interms of model performance. To explore the potential of PAN in integrating with skeleton-based methods, we present two variants: PAN-Ensemble, which employs dual-path graph convolution networks followed by late fusion, and PAN-Unified, which performs unified graph representation learning within a single network. On three widely used multimodal action recognition datasets, both PAN-Ensemble and PAN-Unified achieve state-of-the-art (SOTA) performance in their respective settings of multimodal fusion: separate and unified modeling, respectively.

Towards Adaptive Fusion of Multimodal Deep Networks for Human Action Recognition

CV and Pattern Recognition

Lets computers understand actions by watching, listening, and feeling.

4 Dec 2025 0

90%

Multimodal Skeleton-Based Action Representation Learning via Decomposition and Composition

CV and Pattern Recognition

Teaches computers to understand actions from different senses.

24 Dec 2025 0

89%

SNN-Driven Multimodal Human Action Recognition via Event Camera and Skeleton Data Fusion

CV and Pattern Recognition

Lets computers understand actions using less power.

19 Feb 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

10 pages

Patch as Node: Human-Centric Graph Representation Learning for Multimodal Action Recognition

Makes computers understand actions from videos better.

Technical Abstract

Towards Adaptive Fusion of Multimodal Deep Networks for Human Action Recognition

Multimodal Skeleton-Based Action Representation Learning via Decomposition and Composition

SNN-Driven Multimodal Human Action Recognition via Event Camera and Skeleton Data Fusion