Multilingual Large Language Models do not comprehend all natural languages to equal degrees

📅 2026-02-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the lack of systematic evaluation of multilingual large language models (LLMs) on low-resource and non-WEIRD languages, where performance is often implicitly assumed to peak in English. Evaluating three prominent models across twelve languages spanning five major language families, the authors employ consistent prompting templates, cross-linguistic sampling, and human baselines. Their analysis reveals—contrary to the prevailing “English-as-optimal” assumption—that certain Romance languages, including low-resource ones, outperform English in comprehension tasks. Moreover, all models fall significantly short of human performance, with observed disparities strongly influenced by training data composition, tokenization strategies, and linguistic distance. These findings underscore the critical role of language-specific characteristics in shaping model capabilities and challenge dominant assumptions about cross-lingual performance hierarchies.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) play a critical role in how humans access information. While their core use relies on comprehending written requests, our understanding of this ability is currently limited, because most benchmarks evaluate LLMs in high-resource languages predominantly spoken by Western, Educated, Industrialised, Rich, and Democratic (WEIRD) communities. The default assumption is that English is the best-performing language for LLMs, while smaller, low-resource languages are linked to less reliable outputs, even in multilingual, state-of-the-art models. To track variation in the comprehension abilities of LLMs, we prompt 3 popular models on a language comprehension task across 12 languages, representing the Indo-European, Afro-Asiatic, Turkic, Sino-Tibetan, and Japonic language families. Our results suggest that the models exhibit remarkable linguistic accuracy across typologically diverse languages, yet they fall behind human baselines in all of them, albeit to different degrees. Contrary to what was expected, English is not the best-performing language, as it was systematically outperformed by several Romance languages, even lower-resource ones. We frame the results by discussing the role of several factors that drive LLM performance, such as tokenization, language distance from Spanish and English, size of training data, and data origin in high- vs. low-resource languages and WEIRD vs. non-WEIRD communities.
Problem

Research questions and friction points this paper is trying to address.

Multilingual Large Language Models
language comprehension
low-resource languages
WEIRD bias
cross-lingual performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

multilingual LLMs
language comprehension
low-resource languages
cross-lingual evaluation
WEIRD bias
🔎 Similar Papers
No similar papers found.
N
Natalia Moskvina
Universitat Autònoma de Barcelona
R
Raquel Montero
Universitat Autònoma de Barcelona
M
Masaya Yoshida
Universitat Autònoma de Barcelona; Institució Catalana de Recerca i Estudis Avançats (ICREA)
Ferdy Hubers
Ferdy Hubers
Assistant professor, CLS, Radboud University Nijmegen
psycholinguisticssecond language acquisitionfigurative languagesyntaxlanguage variation
P
Paolo Morosi
Universitat Autònoma de Barcelona
W
Walid Irhaymi
Universitat Autònoma de Barcelona
J
Jin Yan
Universitat Autònoma de Barcelona
T
Tamara Serrano
Universitat Autònoma de Barcelona
E
Elena Pagliarini
Universitat Autònoma de Barcelona
Fritz Günther
Fritz Günther
Department of Psychology, Humboldt-Universität zu Berlin
semantic memorylanguage modelsconceptual combinationform-meaning mappingvision models
Evelina Leivada
Evelina Leivada
Research Professor at ICREA & Universitat Autònoma de Barcelona
BilingualismLanguage VariationLanguage AcquisitionMorphosyntax