How Reliable is Multilingual LLM-as-a-Judge?

📅 2025-05-18

📈 Citations: 0

✨ Influential: 0

career value

151K/year

🤖 AI Summary

This study systematically evaluates the reliability of LLM-as-a-Judge in multilingual settings, revealing severely low cross-lingual judgment consistency (average Fleiss’ Kappa ≈ 0.3). Method: We conduct experiments across five major model families, 25 languages, and five generative task categories; construct a multilingual benchmark; perform cross-lingual performance attribution analysis; and propose an ensemble-based judgment strategy. Contribution/Results: Contrary to common assumptions, we quantitatively demonstrate that neither language resource size nor training data multilinguality is the primary determinant of consistency—especially for low-resource languages, where inconsistency is significantly worse. Our ensemble evaluation framework substantially improves inter-lingual agreement in real-world applications. This work provides both theoretical grounding and practical methodology for trustworthy deployment of multilingual automated evaluation.

Technology Category

Application Category

📝 Abstract

LLM-as-a-Judge has emerged as a popular evaluation strategy, where advanced large language models assess generation results in alignment with human instructions. While these models serve as a promising alternative to human annotators, their reliability in multilingual evaluation remains uncertain. To bridge this gap, we conduct a comprehensive analysis of multilingual LLM-as-a-Judge. Specifically, we evaluate five models from different model families across five diverse tasks involving 25 languages. Our findings reveal that LLMs struggle to achieve consistent judgment results across languages, with an average Fleiss' Kappa of approximately 0.3, and some models performing even worse. To investigate the cause of inconsistency, we analyze various influencing factors. We observe that consistency varies significantly across languages, with particularly poor performance in low-resource languages. Additionally, we find that neither training on multilingual data nor increasing model scale directly improves judgment consistency. These findings suggest that LLMs are not yet reliable for evaluating multilingual predictions. We finally propose an ensemble strategy which improves the consistency of the multilingual judge in real-world applications.

Problem

Research questions and friction points this paper is trying to address.

Assessing reliability of multilingual LLM-as-a-Judge evaluation

Analyzing inconsistency in LLM judgments across 25 languages

Investigating factors affecting multilingual evaluation performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluate five models across 25 languages

Analyze factors affecting judgment inconsistency

Propose ensemble strategy to improve consistency

🔎 Similar Papers

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks