Found in Translation: Measuring Multilingual LLM Consistency as Simple as Translate then Evaluate

📅 2025-05-28

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

Multilingual evaluation of large language models (LLMs) is hindered by reliance on costly human annotation and the difficulty of quantifying open-ended generation tasks. Method: This paper proposes a “translation-then-evaluation” framework: identical queries are translated across languages and fed into the LLM; semantic similarity and empathy scoring models then quantify cross-lingual consistency along two decoupled dimensions—information preservation and empathetic expression. Contribution/Results: The framework enables the first annotation-free, disentangled, and quantifiable assessment of multilingual consistency, supporting 30 languages spanning major language families and writing systems, with task-agnostic applicability and low computational cost. Experiments reveal significant consistency degradation in Slavic languages, Arabic-script languages, and low-resource languages; for certain language pairs, information preservation errors exceed 40%, exposing fundamental limitations in current LLMs’ multilingual capabilities.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) provide detailed and impressive responses to queries in English. However, are they really consistent at responding to the same query in other languages? The popular way of evaluating for multilingual performance of LLMs requires expensive-to-collect annotated datasets. Further, evaluating for tasks like open-ended generation, where multiple correct answers may exist, is nontrivial. Instead, we propose to evaluate the predictability of model response across different languages. In this work, we propose a framework to evaluate LLM's cross-lingual consistency based on a simple Translate then Evaluate strategy. We instantiate this evaluation framework along two dimensions of consistency: information and empathy. Our results reveal pronounced inconsistencies in popular LLM responses across thirty languages, with severe performance deficits in certain language families and scripts, underscoring critical weaknesses in their multilingual capabilities. These findings necessitate cross-lingual evaluations that are consistent along multiple dimensions. We invite practitioners to use our framework for future multilingual LLM benchmarking.

Problem

Research questions and friction points this paper is trying to address.

Evaluating multilingual LLM consistency without expensive annotated datasets

Assessing cross-lingual response predictability via Translate-then-Evaluate

Identifying inconsistencies in LLM performance across 30 languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

Translate then Evaluate strategy

Cross-lingual consistency framework

Multilingual LLM benchmarking tool

🔎 Similar Papers

No similar papers found.