π€ AI Summary
Multilingual large language models (LLMs) exhibit significant performance degradation on low-resource Indian languages like Hindi. Method: This work proposes an enhancement paradigm combining continual pretraining with translation-synthesized bilingual corpora to build Nemotron-Mini-Hindi 4Bβa lightweight bilingual model. It systematically validates, for the first time, the critical role of continual pretraining augmented with high-quality synthetic bilingual data in improving low-resource language capabilities. Training employs bilingual mixed-token sequences over 400B tokens, demonstrating that instruction tuning cannot substitute language-specificεΊε± representation learning. Results: The model achieves state-of-the-art performance on multiple Hindi benchmarks, with substantial gains in factual accuracy and conversational ability, while retaining competitive English task performance. Ablation studies confirm that Hindi-specific pretraining is the primary driver of improvement.
π Abstract
Multilingual LLMs support a variety of languages; however, their performance is suboptimal for low-resource languages. In this work, we emphasize the importance of continued pre-training of multilingual LLMs and the use of translation-based synthetic pre-training corpora for improving LLMs in low-resource languages. We conduct our study in the context of the low-resource Indic language Hindi. We introduce Nemotron-Mini-Hindi 4B, a bilingual SLM supporting both Hindi and English, based on Nemotron-Mini 4B. The model is trained using a mix of real and synthetic Hindi + English tokens, with continuous pre-training performed on 400B tokens. We demonstrate that both the base and instruct models achieve state-of-the-art results on Hindi benchmarks while remaining competitive on English tasks. Additionally, we observe that the continued pre-training approach enhances the model's overall factual accuracy. We perform an ablation study to highlight the impact of Hindi pre-training, showing significant improvements in Hindi chat capabilities and factual accuracy, which cannot be achieved through Hindi alignment alone.