VLM@school -- Evaluation of AI image understanding on German middle school knowledge

📅 2025-06-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language models (VLMs) lack rigorous evaluation in non-English, knowledge-intensive educational contexts. Method: We introduce GERMAN-SCHOOL-VLM—the first German multimodal benchmark aligned with lower-secondary school curricula—covering nine disciplines (e.g., mathematics, history, biology), comprising 486 authentic classroom images and 2,000+ open-ended questions that demand tight integration of domain knowledge and visual reasoning. It is the first VLM benchmark grounded in real-world secondary pedagogy. Contribution/Results: We propose a multidimensional evaluation framework—including domain accuracy, adversarial robustness, and cross-task generalization—and systematically assess 13 open-source VLMs. Results reveal that state-of-the-art models achieve only 44.7% overall accuracy, with pronounced deficits in mathematics and music, underscoring critical limitations in contextualized, interdisciplinary multimodal understanding. This work establishes a novel paradigm and publicly available resource for evaluating AI capabilities in educational settings.

Technology Category

Application Category

📝 Abstract
This paper introduces a novel benchmark dataset designed to evaluate the capabilities of Vision Language Models (VLMs) on tasks that combine visual reasoning with subject-specific background knowledge in the German language. In contrast to widely used English-language benchmarks that often rely on artificially difficult or decontextualized problems, this dataset draws from real middle school curricula across nine domains including mathematics, history, biology, and religion. The benchmark includes over 2,000 open-ended questions grounded in 486 images, ensuring that models must integrate visual interpretation with factual reasoning rather than rely on superficial textual cues. We evaluate thirteen state-of-the-art open-weight VLMs across multiple dimensions, including domain-specific accuracy and performance on adversarial crafted questions. Our findings reveal that even the strongest models achieve less than 45% overall accuracy, with particularly poor performance in music, mathematics, and adversarial settings. Furthermore, the results indicate significant discrepancies between success on popular benchmarks and real-world multimodal understanding. We conclude that middle school-level tasks offer a meaningful and underutilized avenue for stress-testing VLMs, especially in non-English contexts. The dataset and evaluation protocol serve as a rigorous testbed to better understand and improve the visual and linguistic reasoning capabilities of future AI systems.
Problem

Research questions and friction points this paper is trying to address.

Evaluate VLMs on German middle school knowledge tasks
Assess visual reasoning with subject-specific background knowledge
Test VLMs' real-world multimodal understanding in non-English contexts
Innovation

Methods, ideas, or system contributions that make the work stand out.

German middle school curriculum benchmark dataset
Over 2000 open-ended questions with images
Evaluates 13 VLMs on domain-specific accuracy
🔎 Similar Papers
No similar papers found.