Benchmarking Vision Language Models on German Factual Data

📅 2025-04-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies systematic deficiencies in open-weight vision-language models (VLMs) for factually grounded visual understanding in German—despite German being a high-resource language, current VLMs exhibit significantly weaker image content recognition and German text comprehension compared to English. To isolate the effects of visual content versus linguistic prompts, the study introduces a novel bilingual prompting framework and a geographically grounded image dataset covering German landmarks, celebrities, flora/fauna, automobiles, and household objects. It employs a jury-as-a-judge evaluation paradigm with fine-grained category-level comparisons (e.g., scientific names vs. German common names). Results reveal that models underperform markedly on German landmarks and celebrities relative to their international counterparts; recognize flora/fauna reliably only via English or Latin names but fail on German vernacular terms; yet demonstrate cross-lingual robustness for automobiles and household items. This work fills a critical gap in non-English VLM evaluation and provides both methodological rigor and empirical grounding for developing multilingual multimodal benchmarks.

Technology Category

Application Category

📝 Abstract
Similar to LLMs, the development of vision language models is mainly driven by English datasets and models trained in English and Chinese language, whereas support for other languages, even those considered high-resource languages such as German, remains significantly weaker. In this work we present an analysis of open-weight VLMs on factual knowledge in the German and English language. We disentangle the image-related aspects from the textual ones by analyzing accu-racy with jury-as-a-judge in both prompt languages and images from German and international contexts. We found that for celebrities and sights, VLMs struggle because they are lacking visual cognition of German image contents. For animals and plants, the tested models can often correctly identify the image contents ac-cording to the scientific name or English common name but fail in German lan-guage. Cars and supermarket products were identified equally well in English and German images across both prompt languages.
Problem

Research questions and friction points this paper is trying to address.

Assessing VLMs' performance on German factual data
Analyzing visual-textual accuracy gaps in German contexts
Identifying language-specific weaknesses in multilingual VLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmarking VLMs on German factual data
Disentangling image and text accuracy analysis
Evaluating VLMs' multilingual visual cognition
🔎 Similar Papers
No similar papers found.