World in a Frame: Understanding Culture Mixing as a New Challenge for Vision-Language Models
By: Eunsu Kim , Junyeong Park , Na Min An and more
Potential Business Impact:
Helps computers understand mixed cultures in pictures.
In a globalized world, cultural elements from diverse origins frequently appear together within a single visual scene. We refer to these as culture mixing scenarios, yet how Large Vision-Language Models (LVLMs) perceive them remains underexplored. We investigate culture mixing as a critical challenge for LVLMs and examine how current models behave when cultural items from multiple regions appear together. To systematically analyze these behaviors, we construct CultureMix, a food Visual Question Answering (VQA) benchmark with 23k diffusion-generated, human-verified culture mixing images across four subtasks: (1) food-only, (2) food+food, (3) food+background, and (4) food+food+background. Evaluating 10 LVLMs, we find consistent failures to preserve individual cultural identities in mixed settings. Models show strong background reliance, with accuracy dropping 14% when cultural backgrounds are added to food-only baselines, and they produce inconsistent predictions for identical foods across different contexts. To address these limitations, we explore three robustness strategies. We find supervised fine-tuning using a diverse culture mixing dataset substantially improve model consistency and reduce background sensitivity. We call for increased attention to culture mixing scenarios as a critical step toward developing LVLMs capable of operating reliably in culturally diverse real-world environments.
Similar Papers
Vision Language Models are Confused Tourists
CV and Pattern Recognition
Makes AI understand different cultures better.
Rice-VL: Evaluating Vision-Language Models for Cultural Understanding Across ASEAN Countries
CV and Pattern Recognition
Helps computers understand cultures worldwide better.
Toward Socially Aware Vision-Language Models: Evaluating Cultural Competence Through Multimodal Story Generation
Computation and Language
AI stories change to match different cultures.