🤖 AI Summary
To address the scarcity of non-English preference data in multilingual preference alignment, this paper proposes an implicit cross-lingual reward mechanism. Unlike prior approaches, it requires neither explicit multilingual reward models nor human annotations; instead, it implicitly distills reward signals from an English DPO-aligned model via logit-difference analysis, leverages English instructions to evaluate multilingual responses, and autonomously generates high-quality cross-lingual preference data to drive iterative multilingual DPO fine-tuning. This enables zero-shot transfer of English preference knowledge to diverse languages. Applied to Llama3 with only two rounds of fine-tuning, our method improves average win rates on X-AlpacaEval by 12.72% and length-control win rates by 5.97%, substantially reducing reliance on multilingual preference data. To our knowledge, this is the first work to introduce implicit reward modeling into cross-lingual alignment, establishing an efficient and scalable paradigm for preference alignment in low-resource languages.
📝 Abstract
Direct Preference Optimization (DPO) has become a prominent method for aligning Large Language Models (LLMs) with human preferences. While DPO has enabled significant progress in aligning English LLMs, multilingual preference alignment is hampered by data scarcity. To address this, we propose a novel approach that $ extit{captures}$ learned preferences from well-aligned English models by implicit rewards and $ extit{transfers}$ them to other languages through iterative training. Specifically, we derive an implicit reward model from the logits of an English DPO-aligned model and its corresponding reference model. This reward model is then leveraged to annotate preference relations in cross-lingual instruction-following pairs, using English instructions to evaluate multilingual responses. The annotated data is subsequently used for multilingual DPO fine-tuning, facilitating preference knowledge transfer from English to other languages. Fine-tuning Llama3 for two iterations resulted in a 12.72% average improvement in Win Rate and a 5.97% increase in Length Control Win Rate across all training languages on the X-AlpacaEval leaderboard. Our findings demonstrate that leveraging existing English-aligned models can enable efficient and effective multilingual preference alignment, significantly reducing the need for extensive multilingual preference data. The code is available at https://github.com/ZNLP/Implicit-Cross-Lingual-Rewarding