Implicit Cross-Lingual Rewarding for Efficient Multilingual Preference Alignment

📅 2025-03-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the scarcity of non-English preference data in multilingual preference alignment, this paper proposes an implicit cross-lingual reward mechanism. Unlike prior approaches, it requires neither explicit multilingual reward models nor human annotations; instead, it implicitly distills reward signals from an English DPO-aligned model via logit-difference analysis, leverages English instructions to evaluate multilingual responses, and autonomously generates high-quality cross-lingual preference data to drive iterative multilingual DPO fine-tuning. This enables zero-shot transfer of English preference knowledge to diverse languages. Applied to Llama3 with only two rounds of fine-tuning, our method improves average win rates on X-AlpacaEval by 12.72% and length-control win rates by 5.97%, substantially reducing reliance on multilingual preference data. To our knowledge, this is the first work to introduce implicit reward modeling into cross-lingual alignment, establishing an efficient and scalable paradigm for preference alignment in low-resource languages.

Technology Category

Application Category

📝 Abstract
Direct Preference Optimization (DPO) has become a prominent method for aligning Large Language Models (LLMs) with human preferences. While DPO has enabled significant progress in aligning English LLMs, multilingual preference alignment is hampered by data scarcity. To address this, we propose a novel approach that $ extit{captures}$ learned preferences from well-aligned English models by implicit rewards and $ extit{transfers}$ them to other languages through iterative training. Specifically, we derive an implicit reward model from the logits of an English DPO-aligned model and its corresponding reference model. This reward model is then leveraged to annotate preference relations in cross-lingual instruction-following pairs, using English instructions to evaluate multilingual responses. The annotated data is subsequently used for multilingual DPO fine-tuning, facilitating preference knowledge transfer from English to other languages. Fine-tuning Llama3 for two iterations resulted in a 12.72% average improvement in Win Rate and a 5.97% increase in Length Control Win Rate across all training languages on the X-AlpacaEval leaderboard. Our findings demonstrate that leveraging existing English-aligned models can enable efficient and effective multilingual preference alignment, significantly reducing the need for extensive multilingual preference data. The code is available at https://github.com/ZNLP/Implicit-Cross-Lingual-Rewarding
Problem

Research questions and friction points this paper is trying to address.

Multilingual preference alignment hindered by data scarcity.
Transfer learned preferences from English to other languages.
Improve multilingual LLM performance with implicit cross-lingual rewarding.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Implicit rewards transfer preferences across languages
Iterative training enhances multilingual model alignment
English-aligned models reduce multilingual data needs
W
Wen Yang
School of Artificial Intelligence, University of Chinese Academy of Sciences; Institute of Automation, Chinese Academy of Sciences
Junhong Wu
Junhong Wu
PhD student, Institute of Automation, Chinese Academy of Sciences
Natural language processinglifelong learning
C
Chen Wang
School of Artificial Intelligence, University of Chinese Academy of Sciences; Institute of Automation, Chinese Academy of Sciences
C
Chengqing Zong
School of Artificial Intelligence, University of Chinese Academy of Sciences; Institute of Automation, Chinese Academy of Sciences
Jiajun Zhang
Jiajun Zhang
Institute of Automation Chinese Academy of Sciences
Natural Language ProcessingLarge Language ModelsMultimodal Information Processing