Adapting Multilingual LLMs to Low-Resource Languages using Continued Pre-training and Synthetic Corpus

πŸ“… 2024-10-18
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 2
✨ Influential: 1
πŸ“„ PDF
πŸ€– AI Summary
Multilingual large language models (LLMs) exhibit significant performance degradation on low-resource Indian languages like Hindi. Method: This work proposes an enhancement paradigm combining continual pretraining with translation-synthesized bilingual corpora to build Nemotron-Mini-Hindi 4Bβ€”a lightweight bilingual model. It systematically validates, for the first time, the critical role of continual pretraining augmented with high-quality synthetic bilingual data in improving low-resource language capabilities. Training employs bilingual mixed-token sequences over 400B tokens, demonstrating that instruction tuning cannot substitute language-specific底层 representation learning. Results: The model achieves state-of-the-art performance on multiple Hindi benchmarks, with substantial gains in factual accuracy and conversational ability, while retaining competitive English task performance. Ablation studies confirm that Hindi-specific pretraining is the primary driver of improvement.

Technology Category

Application Category

πŸ“ Abstract
Multilingual LLMs support a variety of languages; however, their performance is suboptimal for low-resource languages. In this work, we emphasize the importance of continued pre-training of multilingual LLMs and the use of translation-based synthetic pre-training corpora for improving LLMs in low-resource languages. We conduct our study in the context of the low-resource Indic language Hindi. We introduce Nemotron-Mini-Hindi 4B, a bilingual SLM supporting both Hindi and English, based on Nemotron-Mini 4B. The model is trained using a mix of real and synthetic Hindi + English tokens, with continuous pre-training performed on 400B tokens. We demonstrate that both the base and instruct models achieve state-of-the-art results on Hindi benchmarks while remaining competitive on English tasks. Additionally, we observe that the continued pre-training approach enhances the model's overall factual accuracy. We perform an ablation study to highlight the impact of Hindi pre-training, showing significant improvements in Hindi chat capabilities and factual accuracy, which cannot be achieved through Hindi alignment alone.
Problem

Research questions and friction points this paper is trying to address.

Improving multilingual LLMs for low-resource languages like Hindi
Using continued pre-training and synthetic corpora for enhancement
Achieving state-of-the-art Hindi performance while maintaining English competitiveness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Continued pre-training for low-resource languages
Translation-based synthetic corpora utilization
Bilingual model with mixed real-synthetic data
πŸ”Ž Similar Papers
No similar papers found.
Raviraj Joshi
Raviraj Joshi
Indian Institute of Technology Madras
computer sciencemachine learningnatural language processing
K
Kanishk Singla
NVIDIA
A
Anusha Kamath
NVIDIA
R
Raunak Kalani
NVIDIA
Rakesh Paul
Rakesh Paul
Senior Deep Learning Scientist, NVIDIA
Multilingual NLPLLMModel OptimisationLLM Safety
U
Utkarsh Vaidya
NVIDIA
S
Sanjay Singh Chauhan
NVIDIA
N
Niranjan Wartikar
NVIDIA
E
Eileen Long
NVIDIA