To See or To Read: User Behavior Reasoning in Multimodal LLMs
By: Tianning Dong , Luyi Ma , Varun Vasudevan and more
Potential Business Impact:
Pictures help computers guess what you'll buy next.
Multimodal Large Language Models (MLLMs) are reshaping how modern agentic systems reason over sequential user-behavior data. However, whether textual or image representations of user behavior data are more effective for maximizing MLLM performance remains underexplored. We present \texttt{BehaviorLens}, a systematic benchmarking framework for assessing modality trade-offs in user-behavior reasoning across six MLLMs by representing transaction data as (1) a text paragraph, (2) a scatter plot, and (3) a flowchart. Using a real-world purchase-sequence dataset, we find that when data is represented as images, MLLMs next-purchase prediction accuracy is improved by 87.5% compared with an equivalent textual representation without any additional computational cost.
Similar Papers
Multimodal LLM Augmented Reasoning for Interpretable Visual Perception Analysis
Human-Computer Interaction
Helps computers understand pictures like people do.
GazeLLM: Multimodal LLMs incorporating Human Visual Attention
Human-Computer Interaction
Lets computers understand videos by watching eyes.
Beyond Seeing: Evaluating Multimodal LLMs on Tool-Enabled Image Perception, Transformation, and Reasoning
CV and Pattern Recognition
Helps AI "think" with pictures, not just look.