Calibrating Translation Decoding with Quality Estimation on LLMs

📅 2025-04-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In neural machine translation (NMT), maximum a posteriori (MAP) decoding often yields low-quality translations due to misalignment between the decoding objective (maximizing token-level likelihood) and actual translation quality. This paper proposes a quality-aware probabilistic calibration framework that directly incorporates translation quality estimation into the decoding probability calibration process—optimizing the Pearson correlation between candidate translation log-likelihoods and human- or model-based quality scores. To our knowledge, this is the first work to embed quality estimation explicitly in likelihood calibration. The method is lightweight (requiring only ~2K samples per language pair), supports ten languages, and needs no additional annotations or model fine-tuning. Calibrated likelihoods serve as high-quality unsupervised quality estimation (QE) proxies, with gains orthogonal to supervised fine-tuning and preference optimization. Experiments demonstrate significant improvements over strong baselines—including Tower and CPO—across multilingual automatic metrics and human evaluation; moreover, calibrated likelihoods match or surpass state-of-the-art QE models (e.g., COMET-Kiwi) in quality prediction performance.

Technology Category

Application Category

📝 Abstract
Neural machine translation (NMT) systems typically employ maximum a posteriori (MAP) decoding to select the highest-scoring translation from the distribution mass. However, recent evidence highlights the inadequacy of MAP decoding, often resulting in low-quality or even pathological hypotheses -- the decoding objective is not aligned with real-world translation quality. This paper proposes calibrating hypothesis likelihoods with translation quality from a distribution view by directly optimizing their Pearson correlation -- thereby enhancing the effectiveness of translation decoding. With our method, translation on large language models (LLMs) improves substantially after limited training (2K instances per direction). This improvement is orthogonal to those achieved through supervised fine-tuning, leading to substantial gains across a broad range of metrics and human evaluations -- even when applied to top-performing translation-specialized LLMs fine-tuned on high-quality translation data, such as Tower, or when compared to recent preference optimization methods, like CPO. Moreover, the calibrated translation likelihood can directly serve as a strong proxy for translation quality, closely approximating or even surpassing some state-of-the-art translation quality estimation models, like CometKiwi. Lastly, our in-depth analysis demonstrates that calibration enhances the effectiveness of MAP decoding, thereby enabling greater efficiency in real-world deployment. The resulting state-of-the-art translation model, which covers 10 languages, along with the accompanying code and human evaluation data, has been released to the community: https://github.com/moore3930/calibrating-llm-mt.
Problem

Research questions and friction points this paper is trying to address.

Aligning decoding objective with real-world translation quality
Improving translation decoding effectiveness via likelihood calibration
Enhancing MAP decoding efficiency for practical deployment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimizes Pearson correlation for translation quality
Enhances MAP decoding with calibrated likelihoods
Uses limited training for substantial LLM improvement
🔎 Similar Papers
No similar papers found.