When Languages Disagree: Self-Evolving Multilingual LLM Judges

📅 2026-06-06

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work proposes SEMJ, a novel framework that reinterprets the inconsistent judgments often exhibited by multilingual large language models in cross-lingual evaluation not as noise but as complementary signals. By generating multilingual input variants, collecting independent judgments and reasoning traces, and triggering self-reflection and re-evaluation upon detecting inconsistencies, SEMJ enables iterative refinement of model outputs. Departing from conventional voting or aggregation paradigms, the framework establishes a self-evolving multilingual evaluation mechanism. Experimental results demonstrate that SEMJ significantly outperforms existing baselines across multiple benchmarks, achieving notable improvements in both judgment accuracy and cross-lingual consistency.

📝 Abstract

Multilingual LLM-as-a-judge is widely used to evaluate model outputs across languages, but suffers from cross-lingual inconsistency (Fu and Liu, 2025). Existing methods typically treat this inconsistency as noise and mitigate it through voting or aggregation. In this work, we instead show that multilingual inconsistency can provide complementary evaluation signals. Our oracle analysis finds that sampling judgments across languages yields a higher performance upper bound than single-language judging, indicating that different languages potentially include complementary judgments. Motivated by this finding, we propose SEMJ, a self-evolving multilingual judge that leverages cross-lingual inconsistency for iterative refinement. SEMJ constructs multilingual variants of each input, collects independent judgments and rationales, and feeds inconsistent outputs back for self-reflection and re-evaluation. Experiments on multiple benchmarks show that SEMJ consistently outperforms voting and reflection baselines in both accuracy and cross-lingual consistency. Further analysis shows that inconsistency triggers useful re-evaluation, which improves judgment quality.

Problem

Research questions and friction points this paper is trying to address.

cross-lingual inconsistency

multilingual LLM-as-a-judge

evaluation reliability

judgment disagreement

Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-lingual inconsistency

self-evolving judge

multilingual LLM-as-a-judge

complementary evaluation signals

iterative refinement

🔎 Similar Papers

Adversarial Multi-Agent Evaluation of Large Language Models through Iterative Debates

2024-10-07arXiv.orgCitations: 11