🤖 AI Summary
This study addresses the privacy risks and high costs associated with current machine translation quality evaluation methods that rely on large, closed-source language models. The authors propose a single-prompt strategy leveraging open-source large language models with fewer than 30 billion parameters to simultaneously generate quality scores, MQM error annotations, correction suggestions, and fully edited translations. Experimental results demonstrate that this approach achieves evaluation outcomes highly correlated with human judgments while preserving data privacy and substantially reducing computational costs. Its performance rivals that of large closed-source models and surpasses conventional neural metrics, fine-tuned models, and even inter-annotator agreement among human evaluators, offering a highly efficient and interpretable alternative for translation quality assessment.
📝 Abstract
Current state-of-the-art Quality Estimation (QE) in machine translation relies on massive, proprietary LLMs, raising data privacy concerns. We demonstrate that smaller, open-source LLMs (<30B parameters) are a viable, cost-effective and privacy-preserving alternative. Using a single-pass prompting strategy, our models simultaneously generate quality scores, MQM error annotations, suggested error corrections, and full post-editions. Our analysis shows these models achieve highly competitive system-level correlations with human judgments that outperform traditional neural metrics, fine-tuned models, and human inter-annotator agreement, effectively approximating the capabilities of much larger proprietary LLMs.