Automatic Evaluation Metrics for Document-level Translation: Overview, Challenges and Trends

📅 2025-04-21

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses three critical challenges in automatic evaluation of document-level machine translation (DMT): insufficient reference diversity, overreliance on sentence-level alignment, and bias, unreliability, and lack of interpretability in large language model (LLM)-based judging. We systematically survey the evolution of DMT evaluation paradigms and propose, for the first time, a multidimensional taxonomy encompassing reference-free vs. reference-based, and traditional vs. model-driven vs. LLM-as-a-judge approaches. Methodologically, we advocate reducing sentence-level alignment dependency, introducing discourse-level multi-granularity assessment, and training dedicated MT evaluation models. Through cross-paradigm comparative analysis—integrating BLEU, TER, BERTScore, COMET, and LLM-based judgment—we empirically delineate performance boundaries and contextual applicability of each method. Our contributions include a theoretically grounded, interpretable, and user-friendly framework for robust DMT evaluation, accompanied by actionable guidelines for practical implementation.

Technology Category

Application Category

📝 Abstract

With the rapid development of deep learning technologies, the field of machine translation has witnessed significant progress, especially with the advent of large language models (LLMs) that have greatly propelled the advancement of document-level translation. However, accurately evaluating the quality of document-level translation remains an urgent issue. This paper first introduces the development status of document-level translation and the importance of evaluation, highlighting the crucial role of automatic evaluation metrics in reflecting translation quality and guiding the improvement of translation systems. It then provides a detailed analysis of the current state of automatic evaluation schemes and metrics, including evaluation methods with and without reference texts, as well as traditional metrics, Model-based metrics and LLM-based metrics. Subsequently, the paper explores the challenges faced by current evaluation methods, such as the lack of reference diversity, dependence on sentence-level alignment information, and the bias, inaccuracy, and lack of interpretability of the LLM-as-a-judge method. Finally, the paper looks ahead to the future trends in evaluation methods, including the development of more user-friendly document-level evaluation methods and more robust LLM-as-a-judge methods, and proposes possible research directions, such as reducing the dependency on sentence-level information, introducing multi-level and multi-granular evaluation approaches, and training models specifically for machine translation evaluation. This study aims to provide a comprehensive analysis of automatic evaluation for document-level translation and offer insights into future developments.

Problem

Research questions and friction points this paper is trying to address.

Evaluating document-level translation quality accurately

Challenges in current automatic evaluation metrics

Future trends in document-level translation evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes large language models for translation evaluation

Explores multi-level and multi-granular evaluation approaches

Proposes training models specifically for translation evaluation

🔎 Similar Papers

No similar papers found.

Authors to Follow