🤖 AI Summary
Current multilingual large language model (LLM) evaluations suffer from confounding effects among language distribution, experimental setup, and model architecture—particularly yielding scattered, incomparable results for low-resource languages. To address this, we propose a decoupled evaluation framework introducing three interpretable metrics: (i) performance realization ratio (measuring actual performance relative to theoretical upper bounds), (ii) coefficient of variation (quantifying cross-lingual stability), and (iii) language potential (estimating inherent learnability of a language). This enables fine-grained attribution of model–language interaction effects for the first time. Evaluated across 13 model variants on 11 standardized benchmarks, our framework significantly improves assessment accuracy for low-resource languages. It uncovers a pervasive latent deficiency in mainstream models: high aggregate performance coupled with low cross-lingual fairness. The work establishes a novel paradigm for equitable, interpretable evaluation of multilingual AI systems.
📝 Abstract
Results reported in large-scale multilingual evaluations are often fragmented and confounded by factors such as target languages, differences in experimental setups, and model choices. We propose a framework that disentangles these confounding variables and introduces three interpretable metrics--the performance realisation ratio, its coefficient of variation, and language potential--enabling a finer-grained and more insightful quantification of actual performance disparities across both (i) models and (ii) languages. Through a case study of 13 model variants on 11 multilingual datasets, we demonstrate that our framework provides a more reliable measurement of model performance and language disparities, particularly for low-resource languages, which have so far proven challenging to evaluate. Importantly, our results reveal that higher overall model performance does not necessarily imply greater fairness across languages.