VLM@school -- Evaluation of AI image understanding on German middle school knowledge
By: René Peinl, Vincent Tischler
Potential Business Impact:
Tests AI's smarts using school lessons.
This paper introduces a novel benchmark dataset designed to evaluate the capabilities of Vision Language Models (VLMs) on tasks that combine visual reasoning with subject-specific background knowledge in the German language. In contrast to widely used English-language benchmarks that often rely on artificially difficult or decontextualized problems, this dataset draws from real middle school curricula across nine domains including mathematics, history, biology, and religion. The benchmark includes over 2,000 open-ended questions grounded in 486 images, ensuring that models must integrate visual interpretation with factual reasoning rather than rely on superficial textual cues. We evaluate thirteen state-of-the-art open-weight VLMs across multiple dimensions, including domain-specific accuracy and performance on adversarial crafted questions. Our findings reveal that even the strongest models achieve less than 45% overall accuracy, with particularly poor performance in music, mathematics, and adversarial settings. Furthermore, the results indicate significant discrepancies between success on popular benchmarks and real-world multimodal understanding. We conclude that middle school-level tasks offer a meaningful and underutilized avenue for stress-testing VLMs, especially in non-English contexts. The dataset and evaluation protocol serve as a rigorous testbed to better understand and improve the visual and linguistic reasoning capabilities of future AI systems.
Similar Papers
Benchmarking Vision Language Models on German Factual Data
Computation and Language
Helps computers understand German pictures better.
Evaluating Vision-Language and Large Language Models for Automated Student Assessment in Indonesian Classrooms
Computation and Language
Helps grade student tests and give feedback.
Bias in the Picture: Benchmarking VLMs with Social-Cue News Images and LLM-as-Judge Assessment
CV and Pattern Recognition
Finds and fixes unfairness in AI that sees and reads.