Quantifying Language Disparities in Multilingual Large Language Models

📅 2025-08-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multilingual large language model (LLM) evaluations suffer from confounding effects among language distribution, experimental setup, and model architecture—particularly yielding scattered, incomparable results for low-resource languages. To address this, we propose a decoupled evaluation framework introducing three interpretable metrics: (i) performance realization ratio (measuring actual performance relative to theoretical upper bounds), (ii) coefficient of variation (quantifying cross-lingual stability), and (iii) language potential (estimating inherent learnability of a language). This enables fine-grained attribution of model–language interaction effects for the first time. Evaluated across 13 model variants on 11 standardized benchmarks, our framework significantly improves assessment accuracy for low-resource languages. It uncovers a pervasive latent deficiency in mainstream models: high aggregate performance coupled with low cross-lingual fairness. The work establishes a novel paradigm for equitable, interpretable evaluation of multilingual AI systems.

Technology Category

Application Category

📝 Abstract
Results reported in large-scale multilingual evaluations are often fragmented and confounded by factors such as target languages, differences in experimental setups, and model choices. We propose a framework that disentangles these confounding variables and introduces three interpretable metrics--the performance realisation ratio, its coefficient of variation, and language potential--enabling a finer-grained and more insightful quantification of actual performance disparities across both (i) models and (ii) languages. Through a case study of 13 model variants on 11 multilingual datasets, we demonstrate that our framework provides a more reliable measurement of model performance and language disparities, particularly for low-resource languages, which have so far proven challenging to evaluate. Importantly, our results reveal that higher overall model performance does not necessarily imply greater fairness across languages.
Problem

Research questions and friction points this paper is trying to address.

Measuring performance disparities across multilingual models and languages
Disentangling confounding variables in multilingual evaluation metrics
Assessing fairness for low-resource languages in LLM evaluations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Framework disentangles multilingual evaluation confounding variables
Introduces three interpretable metrics for performance disparities
Provides reliable measurement particularly for low-resource languages