Score: 2

A Large-Scale Multimodal Dataset and Benchmarks for Human Activity Scene Understanding and Reasoning

Published: December 8, 2025 | arXiv ID: 2512.07136v1

By: Siyang Jiang , Mu Yuan , Xiang Ji and more

Potential Business Impact:

Helps computers understand actions from many sensors.

Business Areas:

Image Recognition Data and Analytics, Software

Multimodal human action recognition (HAR) leverages complementary sensors for activity classification. Beyond recognition, recent advances in large language models (LLMs) enable detailed descriptions and causal reasoning, motivating new tasks: human action understanding (HAU) and human action reasoning (HARn). However, most LLMs, especially large vision language models (LVLMs), struggle with non-RGB modalities such as depth, IMU, and mmWave due to the lack of large-scale data-caption resources. Existing HAR datasets mainly provide coarse data-label annotations, which are insufficient to capture fine-grained action dynamics needed for HAU and HARn. We consider two ground-truth pair types: (1) data label (discrete category) and (2) data caption (textual description). Naively generating captions from labels often lacks logical and spatiotemporal consistency. We introduce CUHK-X, a large-scale multimodal dataset and benchmark suite for HAR, HAU, and HARn. CUHK-X contains 58,445 samples covering 40 actions performed by 30 participants across two indoor environments. To improve caption consistency, we propose a prompt-based scene creation method that leverages LLMs to generate logically connected activity sequences, followed by human validation. CUHK-X includes three benchmarks with six evaluation tasks. Experiments report average accuracies of 76.52% (HAR), 40.76% (HAU), and 70.25% (HARn). CUHK-X aims to enable the community to apply and develop data-intensive learning methods for robust, multimodal human activity analysis. Project page and code: https://openaiotlab.github.io/CUHK-X/ and https://github.com/openaiotlab/CUHK-X.

Towards Generalizable Human Activity Recognition: A Survey

Signal Processing

Helps smartwatches understand your movements better.

17 Aug 2025 3

90%

Scaling Human Activity Recognition: A Comparative Evaluation of Synthetic Data Generation and Augmentation Techniques

CV and Pattern Recognition

Creates fake motion data to train activity trackers.

9 Jun 2025 0

89%

A Novel Deep Hybrid Framework with Ensemble-Based Feature Optimization for Robust Real-Time Human Activity Recognition

CV and Pattern Recognition

Helps computers understand what people are doing.

26 Aug 2025 0

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

19 pages

A Large-Scale Multimodal Dataset and Benchmarks for Human Activity Scene Understanding and Reasoning

Helps computers understand actions from many sensors.

Technical Abstract

Towards Generalizable Human Activity Recognition: A Survey

Scaling Human Activity Recognition: A Comparative Evaluation of Synthetic Data Generation and Augmentation Techniques

A Novel Deep Hybrid Framework with Ensemble-Based Feature Optimization for Robust Real-Time Human Activity Recognition