Using LLMs for Late Multimodal Sensor Fusion for Activity Recognition
By: Ilker Demirel , Karan Thakkar , Benjamin Elizalde and more
Potential Business Impact:
Lets computers understand actions from sound and movement.
Sensor data streams provide valuable information around activities and context for downstream applications, though integrating complementary information can be challenging. We show that large language models (LLMs) can be used for late fusion for activity classification from audio and motion time series data. We curated a subset of data for diverse activity recognition across contexts (e.g., household activities, sports) from the Ego4D dataset. Evaluated LLMs achieved 12-class zero- and one-shot classification F1-scores significantly above chance, with no task-specific training. Zero-shot classification via LLM-based fusion from modality-specific models can enable multimodal temporal applications where there is limited aligned training data for learning a shared embedding space. Additionally, LLM-based fusion can enable model deploying without requiring additional memory and computation for targeted application-specific multimodal models.
Similar Papers
DailyLLM: Context-Aware Activity Log Generation Using Multi-Modal Sensors and LLMs
Artificial Intelligence
Makes phones understand your daily life better.
LLMs Meet Cross-Modal Time Series Analytics: Overview and Directions
Machine Learning (CS)
Helps computers understand time data like words.
Context-Aware Human Behavior Prediction Using Multimodal Large Language Models: Challenges and Insights
Robotics
Helps robots understand what people will do.