Score: 0

VLM@school -- Evaluation of AI image understanding on German middle school knowledge

Published: June 13, 2025 | arXiv ID: 2506.11604v2

By: René Peinl, Vincent Tischler

Potential Business Impact:

Tests AI's smarts using school lessons.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

This paper introduces a novel benchmark dataset designed to evaluate the capabilities of Vision Language Models (VLMs) on tasks that combine visual reasoning with subject-specific background knowledge in the German language. In contrast to widely used English-language benchmarks that often rely on artificially difficult or decontextualized problems, this dataset draws from real middle school curricula across nine domains including mathematics, history, biology, and religion. The benchmark includes over 2,000 open-ended questions grounded in 486 images, ensuring that models must integrate visual interpretation with factual reasoning rather than rely on superficial textual cues. We evaluate thirteen state-of-the-art open-weight VLMs across multiple dimensions, including domain-specific accuracy and performance on adversarial crafted questions. Our findings reveal that even the strongest models achieve less than 45% overall accuracy, with particularly poor performance in music, mathematics, and adversarial settings. Furthermore, the results indicate significant discrepancies between success on popular benchmarks and real-world multimodal understanding. We conclude that middle school-level tasks offer a meaningful and underutilized avenue for stress-testing VLMs, especially in non-English contexts. The dataset and evaluation protocol serve as a rigorous testbed to better understand and improve the visual and linguistic reasoning capabilities of future AI systems.

Benchmarking Vision Language Models on German Factual Data

Computation and Language

Helps computers understand German pictures better.

15 Apr 2025 1

91%

Evaluating Vision-Language and Large Language Models for Automated Student Assessment in Indonesian Classrooms

Computation and Language

Helps grade student tests and give feedback.

5 Jun 2025 1

91%

Bias in the Picture: Benchmarking VLMs with Social-Cue News Images and LLM-as-Judge Assessment

CV and Pattern Recognition

Finds and fixes unfairness in AI that sees and reads.

24 Sep 2025 1

View PDF Login to Bookmark

Page Count

17 pages

VLM@school -- Evaluation of AI image understanding on German middle school knowledge

Tests AI's smarts using school lessons.

Technical Abstract

Benchmarking Vision Language Models on German Factual Data

Evaluating Vision-Language and Large Language Models for Automated Student Assessment in Indonesian Classrooms

Bias in the Picture: Benchmarking VLMs with Social-Cue News Images and LLM-as-Judge Assessment