🤖 AI Summary
This work addresses positional bias in translation evaluation—arising from output ordering—that distorts LLM-based quality judgments. We propose a fine-grained, multi-dimensional evaluation and ranking framework grounded in LLM reasoning. Methodologically, we construct structured prompts using a curated subset of MQM guidelines to elicit dimension-specific scores (e.g., accuracy, fluency, faithfulness) and holistic rankings from models including Claude-3.5-Sonnet and Qwen-2.5-72B-Instruct. Our key contributions are: (1) the first systematic identification and explicit modeling of positional bias in LLM-based translation assessment, achieved by reasoning-driven disentanglement of intrinsic content quality from positional artifacts; and (2) strong cross-lingual generalization—achieving state-of-the-art performance on English–Japanese and multiple WMT benchmarks, with Spearman correlations ρ > 0.92 against human judgments. The code, datasets, and evaluation protocol are publicly released.
📝 Abstract
We present TransEvalnia, a prompting-based translation evaluation and ranking system that uses reasoning in performing its evaluations and ranking. This system presents fine-grained evaluations based on a subset of the Multidimensional Quality Metrics (https://themqm.org/), returns an assessment of which translation it deems the best, and provides numerical scores for the various dimensions and for the overall translation. We show that TransEvalnia performs as well as or better than the state-of-the-art MT-Ranker (Moosa et al. 2024) on our own English-Japanese data as well as several language pairs from various WMT shared tasks. Using Anthropic's Claude-3.5-Sonnet and Qwen-2.5-72B-Instruct as the evaluation LLMs, we show that the evaluations returned are deemed highly acceptable to human raters, and that the scores assigned to the translations by Sonnet, as well as other LLMs, correlate well with scores assigned by the human raters. We also note the sensitivity of our system -- as well as MT-Ranker -- to the order in which the translations are presented, and we propose methods to address this position bias. All data, including the system's evaluation and reasoning, human assessments, as well as code is released.