Adapting Multilingual LLMs to Low-Resource Languages using Continued Pre-training and Synthetic Corpus

📅 2024-10-18

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 1

🤖 AI Summary

Multilingual large language models (LLMs) exhibit significant performance degradation on low-resource Indian languages like Hindi. Method: This work proposes an enhancement paradigm combining continual pretraining with translation-synthesized bilingual corpora to build Nemotron-Mini-Hindi 4B—a lightweight bilingual model. It systematically validates, for the first time, the critical role of continual pretraining augmented with high-quality synthetic bilingual data in improving low-resource language capabilities. Training employs bilingual mixed-token sequences over 400B tokens, demonstrating that instruction tuning cannot substitute language-specific底层 representation learning. Results: The model achieves state-of-the-art performance on multiple Hindi benchmarks, with substantial gains in factual accuracy and conversational ability, while retaining competitive English task performance. Ablation studies confirm that Hindi-specific pretraining is the primary driver of improvement.

Technology Category

Application Category

📝 Abstract

Multilingual LLMs support a variety of languages; however, their performance is suboptimal for low-resource languages. In this work, we emphasize the importance of continued pre-training of multilingual LLMs and the use of translation-based synthetic pre-training corpora for improving LLMs in low-resource languages. We conduct our study in the context of the low-resource Indic language Hindi. We introduce Nemotron-Mini-Hindi 4B, a bilingual SLM supporting both Hindi and English, based on Nemotron-Mini 4B. The model is trained using a mix of real and synthetic Hindi + English tokens, with continuous pre-training performed on 400B tokens. We demonstrate that both the base and instruct models achieve state-of-the-art results on Hindi benchmarks while remaining competitive on English tasks. Additionally, we observe that the continued pre-training approach enhances the model's overall factual accuracy. We perform an ablation study to highlight the impact of Hindi pre-training, showing significant improvements in Hindi chat capabilities and factual accuracy, which cannot be achieved through Hindi alignment alone.

Problem

Research questions and friction points this paper is trying to address.

Improving multilingual LLMs for low-resource languages like Hindi

Using continued pre-training and synthetic corpora for enhancement

Achieving state-of-the-art Hindi performance while maintaining English competitiveness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Continued pre-training for low-resource languages

Translation-based synthetic corpora utilization

Bilingual model with mixed real-synthetic data

🔎 Similar Papers

No similar papers found.

Authors to Follow