Language Imbalance Driven Rewarding for Multilingual Self-improving

📅 2024-10-11
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) exhibit severe multilingual capability imbalance, with non-dominant languages significantly underperforming relative to dominant ones like English. Method: This paper proposes, for the first time, modeling cross-lingual performance gaps as a self-supervised reward signal to drive iterative, annotation-free, and parallel-corpus-free self-improvement. Built upon the DPO framework, the approach performs two rounds of preference alignment on Meta-Llama-3-8B-Instruct, evaluated jointly using X-AlpacaEval and MGSM. Contribution/Results: The method achieves a 7.46% average win-rate gain on X-AlpacaEval and a 13.9% absolute improvement in arithmetic reasoning accuracy on MGSM. Crucially, gains are observed across both dominant and non-dominant languages, demonstrating synchronous enhancement. The core innovation lies in transforming linguistic imbalance itself into an optimizable intrinsic reward, enabling cross-lingual co-improvement without external supervision.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have achieved state-of-the-art performance across numerous tasks. However, these advancements have predominantly benefited"first-class"languages such as English and Chinese, leaving many other languages underrepresented. This imbalance, while limiting broader applications, generates a natural preference ranking between languages, offering an opportunity to bootstrap the multilingual capabilities of LLM in a self-improving manner. Thus, we propose $ extit{Language Imbalance Driven Rewarding}$, where the inherent imbalance between dominant and non-dominant languages within LLMs is leveraged as a reward signal. Iterative DPO training demonstrates that this approach not only enhances LLM performance in non-dominant languages but also improves the dominant language's capacity, thereby yielding an iterative reward signal. Fine-tuning Meta-Llama-3-8B-Instruct over two iterations of this approach results in continuous improvements in multilingual performance across instruction-following and arithmetic reasoning tasks, evidenced by an average improvement of 7.46% win rate on the X-AlpacaEval leaderboard and 13.9% accuracy on the MGSM benchmark. This work serves as an initial exploration, paving the way for multilingual self-improvement of LLMs. The code is available at https://github.com/ZNLP/Language-Imbalance-Driven-Rewarding
Problem

Research questions and friction points this paper is trying to address.

Addresses language imbalance in multilingual LLMs
Enhances performance in non-dominant languages
Improves dominant language capabilities iteratively
Innovation

Methods, ideas, or system contributions that make the work stand out.

Language Imbalance Driven Rewarding
Iterative DPO training
Fine-tuning Meta-Llama-3-8B-Instruct
🔎 Similar Papers
No similar papers found.
W
Wen Yang
School of Artificial Intelligence, University of Chinese Academy of Sciences; Institute of Automation, Chinese Academy of Sciences
Junhong Wu
Junhong Wu
PhD student, Institute of Automation, Chinese Academy of Sciences
Natural language processinglifelong learning
C
Chen Wang
School of Artificial Intelligence, University of Chinese Academy of Sciences; Institute of Automation, Chinese Academy of Sciences
C
Chengqing Zong
School of Artificial Intelligence, University of Chinese Academy of Sciences; Institute of Automation, Chinese Academy of Sciences
Jiajun Zhang
Jiajun Zhang
Institute of Automation Chinese Academy of Sciences
Natural Language ProcessingLarge Language ModelsMultimodal Information Processing