Score: 0

Context-Aware Human Behavior Prediction Using Multimodal Large Language Models: Challenges and Insights

Published: April 1, 2025 | arXiv ID: 2504.00839v2

By: Yuchen Liu , Lino Lerch , Luigi Palmieri and more

Potential Business Impact:

Helps robots understand what people will do.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Predicting human behavior in shared environments is crucial for safe and efficient human-robot interaction. Traditional data-driven methods to that end are pre-trained on domain-specific datasets, activity types, and prediction horizons. In contrast, the recent breakthroughs in Large Language Models (LLMs) promise open-ended cross-domain generalization to describe various human activities and make predictions in any context. In particular, Multimodal LLMs (MLLMs) are able to integrate information from various sources, achieving more contextual awareness and improved scene understanding. The difficulty in applying general-purpose MLLMs directly for prediction stems from their limited capacity for processing large input sequences, sensitivity to prompt design, and expensive fine-tuning. In this paper, we present a systematic analysis of applying pre-trained MLLMs for context-aware human behavior prediction. To this end, we introduce a modular multimodal human activity prediction framework that allows us to benchmark various MLLMs, input variations, In-Context Learning (ICL), and autoregressive techniques. Our evaluation indicates that the best-performing framework configuration is able to reach 92.8% semantic similarity and 66.1% exact label accuracy in predicting human behaviors in the target frame.

M-CALLM: Multi-level Context Aware LLM Framework for Group Interaction Prediction

Human-Computer Interaction

Helps computers guess what groups will do together.

18 Nov 2025 0

90%

Unlocking In-Context Learning for Natural Datasets Beyond Language Modelling

Computation and Language

Teaches computers to learn new things from examples.

9 Jan 2025 1

90%

Few-shot Vision-based Human Activity Recognition with MLLM-based Visual Reinforcement Learning

Robotics

Teaches computers to recognize actions from few pictures.

14 Aug 2025 0

View PDF Login to Bookmark

Country of Origin

🇩🇪 Germany

Page Count

8 pages

Context-Aware Human Behavior Prediction Using Multimodal Large Language Models: Challenges and Insights

Helps robots understand what people will do.

Technical Abstract

M-CALLM: Multi-level Context Aware LLM Framework for Group Interaction Prediction

Unlocking In-Context Learning for Natural Datasets Beyond Language Modelling

Few-shot Vision-based Human Activity Recognition with MLLM-based Visual Reinforcement Learning