MobiDiary: Autoregressive Action Captioning with Wearable Devices and Wireless Signals
By: Fei Deng , Yinghui He , Chuntong Chu and more
Human Activity Recognition (HAR) in smart homes is critical for health monitoring and assistive living. While vision-based systems are common, they face privacy concerns and environmental limitations (e.g., occlusion). In this work, we present MobiDiary, a framework that generates natural language descriptions of daily activities directly from heterogeneous physical signals (specifically IMU and Wi-Fi). Unlike conventional approaches that restrict outputs to pre-defined labels, MobiDiary produces expressive, human-readable summaries. To bridge the semantic gap between continuous, noisy physical signals and discrete linguistic descriptions, we propose a unified sensor encoder. Instead of relying on modality-specific engineering, we exploit the shared inductive biases of motion-induced signals--where both inertial and wireless data reflect underlying kinematic dynamics. Specifically, our encoder utilizes a patch-based mechanism to capture local temporal correlations and integrates heterogeneous placement embedding to unify spatial contexts across different sensors. These unified signal tokens are then fed into a Transformer-based decoder, which employs an autoregressive mechanism to generate coherent action descriptions word-by-word. We comprehensively evaluate our approach on multiple public benchmarks (XRF V2, UWash, and WiFiTAD). Experimental results demonstrate that MobiDiary effectively generalizes across modalities, achieving state-of-the-art performance on captioning metrics (e.g., BLEU@4, CIDEr, RMC) and outperforming specialized baselines in continuous action understanding.
Similar Papers
MotionTeller: Multi-modal Integration of Wearable Time-Series with LLMs for Health and Behavioral Understanding
Machine Learning (CS)
Turns movement data into easy-to-read stories.
Human Motion Estimation with Everyday Wearables
CV and Pattern Recognition
Tracks your body movements using your phone.
Towards Generalizable Human Activity Recognition: A Survey
Signal Processing
Helps smartwatches understand your movements better.