Language Imbalance Driven Rewarding for Multilingual Self-improving

📅 2024-10-11

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Large language models (LLMs) exhibit severe multilingual capability imbalance, with non-dominant languages significantly underperforming relative to dominant ones like English. Method: This paper proposes, for the first time, modeling cross-lingual performance gaps as a self-supervised reward signal to drive iterative, annotation-free, and parallel-corpus-free self-improvement. Built upon the DPO framework, the approach performs two rounds of preference alignment on Meta-Llama-3-8B-Instruct, evaluated jointly using X-AlpacaEval and MGSM. Contribution/Results: The method achieves a 7.46% average win-rate gain on X-AlpacaEval and a 13.9% absolute improvement in arithmetic reasoning accuracy on MGSM. Crucially, gains are observed across both dominant and non-dominant languages, demonstrating synchronous enhancement. The core innovation lies in transforming linguistic imbalance itself into an optimizable intrinsic reward, enabling cross-lingual co-improvement without external supervision.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have achieved state-of-the-art performance across numerous tasks. However, these advancements have predominantly benefited"first-class"languages such as English and Chinese, leaving many other languages underrepresented. This imbalance, while limiting broader applications, generates a natural preference ranking between languages, offering an opportunity to bootstrap the multilingual capabilities of LLM in a self-improving manner. Thus, we propose $ extit{Language Imbalance Driven Rewarding}$, where the inherent imbalance between dominant and non-dominant languages within LLMs is leveraged as a reward signal. Iterative DPO training demonstrates that this approach not only enhances LLM performance in non-dominant languages but also improves the dominant language's capacity, thereby yielding an iterative reward signal. Fine-tuning Meta-Llama-3-8B-Instruct over two iterations of this approach results in continuous improvements in multilingual performance across instruction-following and arithmetic reasoning tasks, evidenced by an average improvement of 7.46% win rate on the X-AlpacaEval leaderboard and 13.9% accuracy on the MGSM benchmark. This work serves as an initial exploration, paving the way for multilingual self-improvement of LLMs. The code is available at https://github.com/ZNLP/Language-Imbalance-Driven-Rewarding

Problem

Research questions and friction points this paper is trying to address.

Addresses language imbalance in multilingual LLMs

Enhances performance in non-dominant languages

Improves dominant language capabilities iteratively

Innovation

Methods, ideas, or system contributions that make the work stand out.

Language Imbalance Driven Rewarding

Iterative DPO training

Fine-tuning Meta-Llama-3-8B-Instruct

🔎 Similar Papers

No similar papers found.

Authors to Follow