🤖 AI Summary
This work identifies systematic deficiencies in open-weight vision-language models (VLMs) for factually grounded visual understanding in German—despite German being a high-resource language, current VLMs exhibit significantly weaker image content recognition and German text comprehension compared to English. To isolate the effects of visual content versus linguistic prompts, the study introduces a novel bilingual prompting framework and a geographically grounded image dataset covering German landmarks, celebrities, flora/fauna, automobiles, and household objects. It employs a jury-as-a-judge evaluation paradigm with fine-grained category-level comparisons (e.g., scientific names vs. German common names). Results reveal that models underperform markedly on German landmarks and celebrities relative to their international counterparts; recognize flora/fauna reliably only via English or Latin names but fail on German vernacular terms; yet demonstrate cross-lingual robustness for automobiles and household items. This work fills a critical gap in non-English VLM evaluation and provides both methodological rigor and empirical grounding for developing multilingual multimodal benchmarks.
📝 Abstract
Similar to LLMs, the development of vision language models is mainly driven by English datasets and models trained in English and Chinese language, whereas support for other languages, even those considered high-resource languages such as German, remains significantly weaker. In this work we present an analysis of open-weight VLMs on factual knowledge in the German and English language. We disentangle the image-related aspects from the textual ones by analyzing accu-racy with jury-as-a-judge in both prompt languages and images from German and international contexts. We found that for celebrities and sights, VLMs struggle because they are lacking visual cognition of German image contents. For animals and plants, the tested models can often correctly identify the image contents ac-cording to the scientific name or English common name but fail in German lan-guage. Cars and supermarket products were identified equally well in English and German images across both prompt languages.