MultiLoKo: a multilingual local knowledge benchmark for LLMs spanning 31 languages

📅 2025-04-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of benchmarks for evaluating multilingual large language models’ (LLMs’) comprehension of locale-specific knowledge. We introduce MultiLoKo, the first multilingual local-knowledge benchmark covering 31 languages. Methodologically, it establishes a novel evaluation paradigm grounded in locale-contextual propositions—comprising locally authored items, human-produced bidirectional translations, and machine-translated parallel items—and incorporates translation-bias-controlled assessment to systematically expose how machine translation distorts perceived language difficulty and induces model ranking shifts. Experimental results show that 11 state-of-the-art multilingual LLMs achieve low average scores on MultiLoKo, with substantial cross-lingual performance variance; the target language itself significantly impacts performance; score gaps between local and English-translated items exceed 20 points; and machine translation consistently overestimates model capability at statistically significant levels. MultiLoKo thus provides a rigorous new tool for attributing cross-lingual knowledge capabilities.

Technology Category

Application Category

📝 Abstract
We present MultiLoKo, a new benchmark for evaluating multilinguality in LLMs covering 31 languages. MultiLoKo consists of three partitions: a main partition consisting of 500 questions per language, separately sourced to be locally relevant to the specific language, and two translated partitions, containing human-authored translations from 30 non-English languages to English and vice versa. For comparison, we also release corresponding machine-authored translations. The data is equally distributed over two splits: a dev split and a blind, out-of-distribution test split. MultiLoKo can be used to study a variety of questions regarding the multilinguality of LLMs as well as meta-questions about multilingual benchmark creation. We compute MultiLoKo scores for 11 base and chat models marketed to be multilingual and study their average performance, their performance parity across languages, how much their ability to answer questions depends on the question language, and which languages are most difficult. None of the models we studied performs well on MultiLoKo, as indicated by low average scores as well as large differences between the best and worst scoring languages. Furthermore, we find a substantial effect of the question language, indicating sub-optimal knowledge transfer between languages. Lastly, we find that using local vs English-translated data can result in differences more than 20 points for the best performing models, drastically change the estimated difficulty of some languages. For using machines instead of human translations, we find a weaker effect on ordering of language difficulty, a larger difference in model rankings, and a substantial drop in estimated performance for all models.
Problem

Research questions and friction points this paper is trying to address.

Evaluating multilingual performance of LLMs across 31 languages
Assessing knowledge transfer between languages in LLMs
Comparing human vs machine translations in multilingual benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual benchmark covering 31 languages
Human and machine translations comparison
Local vs English-translated data impact