🤖 AI Summary
To address optimization bottlenecks in non-English machine translation caused by pretraining biases and low-quality preference data in large language models (LLMs), this paper proposes Confidence-Reward Dual Preference Optimization (CR-DPO). CR-DPO is the first method to incorporate model confidence—measured via output entropy and logit margin—into the DPO framework, enabling dynamic identification of high-uncertainty or low-performance sentence pairs to construct high-quality, challenging preference samples. It is universally applicable to both LLMs and encoder-decoder models (e.g., NLLB). By jointly weighting sample selection and gradient updates with confidence scores and multi-dimensional reward signals, CR-DPO achieves state-of-the-art performance across multilingual translation tasks. It significantly outperforms baselines—including RS-DPO, RSO, and MBR—in BLEU and COMET scores, while also demonstrating superior efficiency in labeled data utilization.
📝 Abstract
Large language models (LLMs) have shown great potential in natural language processing tasks, but their application to machine translation (MT) remains challenging due to pretraining on English-centric data and the complexity of reinforcement learning from human feedback (RLHF). Direct Preference Optimization (DPO) has emerged as a simpler and more efficient alternative, but its performance depends heavily on the quality of preference data. To address this, we propose Confidence-Reward driven Preference Optimization (CRPO), a novel method that combines reward scores with model confidence to improve data selection for fine-tuning. CRPO selects challenging sentence pairs where the model is uncertain or underperforms, leading to more effective learning. While primarily designed for LLMs, CRPO also generalizes to encoder-decoder models like NLLB, demonstrating its versatility. Empirical results show that CRPO outperforms existing methods such as RS-DPO, RSO and MBR score in both translation accuracy and data efficiency.