Are Vision Language Models Cross-Cultural Theory of Mind Reasoners?
By: Zabir Al Nazi , G M Shahariar , Abrar Hossain and more
Potential Business Impact:
Lets computers understand people from different cultures.
Theory of Mind (ToM) -- the ability to attribute beliefs, desires, and emotions to others -- is fundamental for human social intelligence, yet remains a major challenge for artificial agents. Existing Vision-Language Models (VLMs) are increasingly applied in socially grounded tasks, but their capacity for cross-cultural ToM reasoning is largely unexplored. In this work, we introduce CulturalToM-VQA, a new evaluation benchmark containing 5095 questions designed to probe ToM reasoning across diverse cultural contexts through visual question answering. The dataset captures culturally grounded cues such as rituals, attire, gestures, and interpersonal dynamics, enabling systematic evaluation of ToM reasoning beyond Western-centric benchmarks. Our dataset is built through a VLM-assisted human-in-the-loop pipeline, where human experts first curate culturally rich images across traditions, rituals, and social interactions; a VLM then assist in generating structured ToM-focused scene descriptions, which are refined into question-answer pairs spanning a taxonomy of six ToM tasks and four graded complexity levels. The resulting dataset covers diverse theory of mind facets such as mental state attribution, false belief reasoning, non-literal communication, social norm violations, perspective coordination, and multi-agent reasoning.
Similar Papers
MindPower: Enabling Theory-of-Mind Reasoning in VLM-based Embodied Agents
Artificial Intelligence
Robots understand what people think and do.
Do Theory of Mind Benchmarks Need Explicit Human-like Reasoning in Language Models?
Computation and Language
Computers can guess what others think, but maybe not really.
Theory of Mind in Large Language Models: Assessment and Enhancement
Computation and Language
Helps computers understand what people are thinking.