UrduLLaMA 1.0: Dataset Curation, Preprocessing, and Evaluation in Low-Resource Settings

📅 2025-02-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the suboptimal performance of multilingual large language models (LLMs) on low-resource languages—specifically Urdu—this paper introduces UrduLLaMA 1.0, the first LLM exclusively designed for Urdu. Built upon the Llama-3.1-8B-Instruct architecture, it innovatively combines continual pretraining with parameter-efficient fine-tuning (PEFT) under extremely limited data conditions: 128M Urdu tokens for continual pretraining, followed by LoRA-based fine-tuning on 41K Urdu instruction-response pairs and 50K English–Urdu parallel sentence pairs. This hybrid adaptation strategy substantially enhances Urdu language understanding, instruction following, and English–Urdu translation capabilities. On three major machine translation benchmarks, UrduLLaMA 1.0 achieves BLEU scores surpassing prior state-of-the-art (SOTA) methods by +4.2–6.8 points. The work establishes a new paradigm and benchmark for high-quality, small-scale LLM adaptation in low-resource settings.

Technology Category

Application Category

📝 Abstract
Multilingual Large Language Models (LLMs) often provide suboptimal performance on low-resource languages like Urdu. This paper introduces UrduLLaMA 1.0, a model derived from the open-source Llama-3.1-8B-Instruct architecture and continually pre-trained on 128 million Urdu tokens, capturing the rich diversity of the language. To enhance instruction-following and translation capabilities, we leverage Low-Rank Adaptation (LoRA) to fine tune the model on 41,000 Urdu instructions and approximately 50,000 English-Urdu translation pairs. Evaluation across three machine translation datasets demonstrates significant performance improvements compared to state-of-the-art (SOTA) models, establishing a new benchmark for Urdu LLMs. These findings underscore the potential of targeted adaptation strategies with limited data and computational resources to address the unique challenges of low-resource languages.
Problem

Research questions and friction points this paper is trying to address.

Improves Urdu language model performance
Enhances instruction-following and translation capabilities
Addresses challenges in low-resource language settings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Continual pre-training on Urdu tokens
LoRA for fine-tuning instructions
Enhanced translation with English-Urdu pairs
🔎 Similar Papers
No similar papers found.
L
Layba Fiaz
Center for Language Engineering, Al-Khawarizmi Institute of Computer Science, University of Engineering and Technology, Lahore
M
Munief Hassan Tahir
Center for Language Engineering, Al-Khawarizmi Institute of Computer Science, University of Engineering and Technology, Lahore
S
Sana Shams
Center for Language Engineering, Al-Khawarizmi Institute of Computer Science, University of Engineering and Technology, Lahore
Sarmad Hussain
Sarmad Hussain
Professor, Center for Language Engineering, KICS-UET
Computational Linguistics