CRPO: Confidence-Reward Driven Preference Optimization for Machine Translation

📅 2025-01-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address optimization bottlenecks in non-English machine translation caused by pretraining biases and low-quality preference data in large language models (LLMs), this paper proposes Confidence-Reward Dual Preference Optimization (CR-DPO). CR-DPO is the first method to incorporate model confidence—measured via output entropy and logit margin—into the DPO framework, enabling dynamic identification of high-uncertainty or low-performance sentence pairs to construct high-quality, challenging preference samples. It is universally applicable to both LLMs and encoder-decoder models (e.g., NLLB). By jointly weighting sample selection and gradient updates with confidence scores and multi-dimensional reward signals, CR-DPO achieves state-of-the-art performance across multilingual translation tasks. It significantly outperforms baselines—including RS-DPO, RSO, and MBR—in BLEU and COMET scores, while also demonstrating superior efficiency in labeled data utilization.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have shown great potential in natural language processing tasks, but their application to machine translation (MT) remains challenging due to pretraining on English-centric data and the complexity of reinforcement learning from human feedback (RLHF). Direct Preference Optimization (DPO) has emerged as a simpler and more efficient alternative, but its performance depends heavily on the quality of preference data. To address this, we propose Confidence-Reward driven Preference Optimization (CRPO), a novel method that combines reward scores with model confidence to improve data selection for fine-tuning. CRPO selects challenging sentence pairs where the model is uncertain or underperforms, leading to more effective learning. While primarily designed for LLMs, CRPO also generalizes to encoder-decoder models like NLLB, demonstrating its versatility. Empirical results show that CRPO outperforms existing methods such as RS-DPO, RSO and MBR score in both translation accuracy and data efficiency.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Translation Tasks
Direct Preference Optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

CRPO method
reward mechanism
model confidence
🔎 Similar Papers
No similar papers found.
Guofeng Cui
Guofeng Cui
Rutgers University
Machine LearningSymbolic Reasoning
P
Pichao Wang
Amazon
Y
Yang Liu
Amazon
Z
Zemian Ke
Amazon
Z
Zhu Liu
Amazon
V
Vimal Bhat
Amazon