Patch as Node: Human-Centric Graph Representation Learning for Multimodal Action Recognition
By: Zeyu Liang, Hailun Xia, Naichuan Zheng
Potential Business Impact:
Makes computers understand actions from videos better.
While human action recognition has witnessed notable achievements, multimodal methods fusing RGB and skeleton modalities still suffer from their inherent heterogeneity and fail to fully exploit the complementary potential between them. In this paper, we propose PAN, the first human-centric graph representation learning framework for multimodal action recognition, in which token embeddings of RGB patches containing human joints are represented as spatiotemporal graphs. The human-centric graph modeling paradigm suppresses the redundancy in RGB frames and aligns well with skeleton-based methods, thus enabling a more effective and semantically coherent fusion of multimodal features. Since the sampling of token embeddings heavily relies on 2D skeletal data, we further propose attention-based post calibration to reduce the dependency on high-quality skeletal data at a minimal cost interms of model performance. To explore the potential of PAN in integrating with skeleton-based methods, we present two variants: PAN-Ensemble, which employs dual-path graph convolution networks followed by late fusion, and PAN-Unified, which performs unified graph representation learning within a single network. On three widely used multimodal action recognition datasets, both PAN-Ensemble and PAN-Unified achieve state-of-the-art (SOTA) performance in their respective settings of multimodal fusion: separate and unified modeling, respectively.
Similar Papers
Towards Adaptive Fusion of Multimodal Deep Networks for Human Action Recognition
CV and Pattern Recognition
Lets computers understand actions by watching, listening, and feeling.
Multimodal Skeleton-Based Action Representation Learning via Decomposition and Composition
CV and Pattern Recognition
Teaches computers to understand actions from different senses.
SNN-Driven Multimodal Human Action Recognition via Event Camera and Skeleton Data Fusion
CV and Pattern Recognition
Lets computers understand actions using less power.