Vision Language Models are Confused Tourists
By: Patrick Amadeus Irawan , Ikhlasul Akmal Hanif , Muhammad Dehan Al Kautsar and more
Potential Business Impact:
Makes AI understand different cultures better.
Although the cultural dimension has been one of the key aspects in evaluating Vision-Language Models (VLMs), their ability to remain stable across diverse cultural inputs remains largely untested, despite being crucial to support diversity and multicultural societies. Existing evaluations often rely on benchmarks featuring only a singular cultural concept per image, overlooking scenarios where multiple, potentially unrelated cultural cues coexist. To address this gap, we introduce ConfusedTourist, a novel cultural adversarial robustness suite designed to assess VLMs' stability against perturbed geographical cues. Our experiments reveal a critical vulnerability, where accuracy drops heavily under simple image-stacking perturbations and even worsens with its image-generation-based variant. Interpretability analyses further show that these failures stem from systematic attention shifts toward distracting cues, diverting the model from its intended focus. These findings highlight a critical challenge: visual cultural concept mixing can substantially impair even state-of-the-art VLMs, underscoring the urgent need for more culturally robust multimodal understanding.
Similar Papers
World in a Frame: Understanding Culture Mixing as a New Challenge for Vision-Language Models
CV and Pattern Recognition
Helps computers understand mixed cultures in pictures.
Toward Socially Aware Vision-Language Models: Evaluating Cultural Competence Through Multimodal Story Generation
Computation and Language
AI stories change to match different cultures.
Examining Vision Language Models through Multi-dimensional Experiments with Vision and Text Features
CV and Pattern Recognition
Fixes AI mistakes when looking at pictures.