🤖 AI Summary
Existing LLM evaluation benchmarks for multilingual European contexts lack systematic taxonomy, insufficient cultural adaptation, and coordinated governance. Method: This paper introduces the first taxonomy specifically designed for European linguistic diversity, developed through systematic literature review and analysis. It establishes a structured classification framework integrating linguistic typology and cultural dimensions, alongside quality norms, technical recommendations, and best practices for cross-lingual fair assessment. Contribution/Results: The work innovatively (1) incorporates language typology and regional cultural factors into benchmark classification logic, and (2) proposes an extensible benchmark metadata model with interoperability guidelines. The resulting *European Language LLM Evaluation Practice Guide* has been formally adopted by the European Commission’s AI Office, significantly enhancing the systematicity, cultural appropriateness, and cross-benchmark comparability of non-English LLM evaluations.
📝 Abstract
While new benchmarks for large language models (LLMs) are being developed continuously to catch up with the growing capabilities of new models and AI in general, using and evaluating LLMs in non-English languages remains a little-charted landscape. We give a concise overview of recent developments in LLM benchmarking, and then propose a new taxonomy for the categorization of benchmarks that is tailored to multilingual or non-English use scenarios. We further propose a set of best practices and quality standards that could lead to a more coordinated development of benchmarks for European languages. Among other recommendations, we advocate for a higher language and culture sensitivity of evaluation methods.