Multimodal Fusion and Interpretability in Human Activity Recognition: A Reproducible Framework for Sensor-Based Modeling
By: Yiyao Yang, Yasemin Gulbahar
Potential Business Impact:
Helps computers understand people by combining senses.
The research presents a comprehensive framework for consolidating multimodal sensor data collected under naturalistic conditions, grounded in the Carnegie Mellon University Multi-Modal Activity Database (CMU-MMAC). Focusing on Subject 07-Brownie, the study investigates the entire processing pipeline, from data alignment and transformation to fusion method evaluation, interpretability, and modality contribution. A unified preprocessing pipeline is developed to temporally align heterogeneous video and audio data. Fusion is performed through resampling, grayscale conversion, segmentation, and feature standardization. Semantic richness is confirmed via heatmaps, spectrograms, and luminance time series, while frame-aligned waveform overlays demonstrate temporal consistency. Results indicate that late fusion yields the highest validation accuracy, followed by hybrid fusion, with early fusion performing the lowest. To assess the interpretability and discriminative power of audio and video in fused activity recognition, PCA and t-SNE visualize feature coherence over time. Classification results show limited performance for audio alone, moderate for video, and significant improvement with multimodal fusion, underscoring the strengths of combined data. Incorporating RFID data, which captures sparse interactions asynchronously, further enhances recognition accuracy by over 50% and improves macro-averaged ROC-AUC. The framework demonstrates the potential to transform raw, asynchronous sensor data into aligned, semantically meaningful representations, providing a reproducible approach for multimodal data integration and interpretation in intelligent systems designed to perceive complex human activities.
Similar Papers
Multimodal Foundation Model for Cross-Modal Retrieval and Activity Recognition Tasks
CV and Pattern Recognition
Lets computers understand full body movements better.
Towards Adaptive Fusion of Multimodal Deep Networks for Human Action Recognition
CV and Pattern Recognition
Lets computers understand actions by watching, listening, and feeling.
A Novel Deep Hybrid Framework with Ensemble-Based Feature Optimization for Robust Real-Time Human Activity Recognition
CV and Pattern Recognition
Helps computers understand what people are doing.