VQArt-Bench: A semantically rich VQA Benchmark for Art and Cultural Heritage
By: A. Alfarano, L. Venturoli, D. Negueruela del Castillo
Potential Business Impact:
Tests if computers truly understand art.
Multimodal Large Language Models (MLLMs) have demonstrated significant capabilities in joint visual and linguistic tasks. However, existing Visual Question Answering (VQA) benchmarks often fail to evaluate deep semantic understanding, particularly in complex domains like visual art analysis. Confined to simple syntactic structures and surface-level attributes, these questions fail to capture the diversity and depth of human visual inquiry. This limitation incentivizes models to exploit statistical shortcuts rather than engage in visual reasoning. To address this gap, we introduce VQArt-Bench, a new, large-scale VQA benchmark for the cultural heritage domain. This benchmark is constructed using a novel multi-agent pipeline where specialized agents collaborate to generate nuanced, validated, and linguistically diverse questions. The resulting benchmark is structured along relevant visual understanding dimensions that probe a model's ability to interpret symbolic meaning, narratives, and complex visual relationships. Our evaluation of 14 state-of-the-art MLLMs on this benchmark reveals significant limitations in current models, including a surprising weakness in simple counting tasks and a clear performance gap between proprietary and open-source models.
Similar Papers
Evaluating Variance in Visual Question Answering Benchmarks
CV and Pattern Recognition
Makes AI answers more trustworthy and consistent.
EncQA: Benchmarking Vision-Language Models on Visual Encodings for Charts
CV and Pattern Recognition
Helps computers better understand charts and graphs.
VisR-Bench: An Empirical Study on Visual Retrieval-Augmented Generation for Multilingual Long Document Understanding
CV and Pattern Recognition
Helps computers find answers in any language document.