Test Set Quality in Multilingual LLM Evaluation

📅 2025-08-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies pervasive annotation errors in benchmark datasets used for evaluating multilingual large language models, with particularly severe quality deficiencies observed in French and Telugu test sets. To quantify their impact, the authors conduct meticulous human annotation and comparative analysis, systematically assessing multiple state-of-the-art models on both original and corrected versions of the datasets. Results reveal that annotation errors induce up to a 9.8% accuracy deviation—substantially distorting model capability assessments. Crucially, this study challenges the prevailing “immutable dataset” assumption in multilingual evaluation and proposes a paradigm shift toward versioned dataset management and continuous quality auditing. It further introduces a practical, systematic framework for dataset construction, validation, and iterative refinement. By establishing methodological foundations and actionable guidelines, this work advances the reliability and rigor of multilingual model evaluation.

Technology Category

Application Category

📝 Abstract
Several multilingual benchmark datasets have been developed in a semi-automatic manner in the recent past to measure progress and understand the state-of-the-art in the multilingual capabilities of Large Language Models. However, there is not a lot of attention paid to the quality of the datasets themselves, despite the existence of previous work in identifying errors in even fully human-annotated test sets. In this paper, we manually analyze recent multilingual evaluation sets in two languages - French and Telugu, identifying several errors in the process. We compare the performance difference across several LLMs with the original and revised versions of the datasets and identify large differences (almost 10% in some cases) in both languages). Based on these results, we argue that test sets should not be considered immutable and should be revisited, checked for correctness, and potentially versioned. We end with some recommendations for both the dataset creators as well as consumers on addressing the dataset quality issues.
Problem

Research questions and friction points this paper is trying to address.

Evaluating multilingual LLM test set quality
Identifying errors in multilingual benchmark datasets
Recommending improvements for dataset quality control
Innovation

Methods, ideas, or system contributions that make the work stand out.

Manual analysis of multilingual evaluation sets
Comparison of LLM performance across dataset versions
Recommendations for dataset quality improvement
🔎 Similar Papers
No similar papers found.