🤖 AI Summary
Existing LLM evaluation benchmarks exhibit severe English bias; multilingual—particularly European-language—benchmarks suffer from four critical deficiencies: weak cultural adaptability, low translation fidelity, incomplete coverage of reasoning capabilities, and uncalibrated biases. Method: We conduct the first systematic evaluation of seven mainstream multilingual European benchmarks, identifying their shared limitations, and propose a novel “human-AI collaborative verification + iterative translation ranking” paradigm that integrates culture-aware design with rigorous human validation. We further develop a reproducible, multidimensional analytical framework encompassing translation quality assessment, cultural bias diagnosis, and fine-grained reasoning capability evaluation. Contribution/Results: Our work advances the establishment of fair, verifiable, and culturally sensitive evaluation standards for LLMs in European languages, providing both methodological foundations and practical guidelines for multilingual LLM assessment.
📝 Abstract
The breakthrough of generative large language models (LLMs) that can solve different tasks through chat interaction has led to a significant increase in the use of general benchmarks to assess the quality or performance of these models beyond individual applications. There is also a need for better methods to evaluate and also to compare models due to the ever increasing number of new models published. However, most of the established benchmarks revolve around the English language. This paper analyses the benefits and limitations of current evaluation datasets, focusing on multilingual European benchmarks. We analyse seven multilingual benchmarks and identify four major challenges. Furthermore, we discuss potential solutions to enhance translation quality and mitigate cultural biases, including human-in-the-loop verification and iterative translation ranking. Our analysis highlights the need for culturally aware and rigorously validated benchmarks to assess the reasoning and question-answering capabilities of multilingual LLMs accurately.