UrduLLaMA 1.0: Dataset Curation, Preprocessing, and Evaluation in Low-Resource Settings

📅 2025-02-24

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

To address the suboptimal performance of multilingual large language models (LLMs) on low-resource languages—specifically Urdu—this paper introduces UrduLLaMA 1.0, the first LLM exclusively designed for Urdu. Built upon the Llama-3.1-8B-Instruct architecture, it innovatively combines continual pretraining with parameter-efficient fine-tuning (PEFT) under extremely limited data conditions: 128M Urdu tokens for continual pretraining, followed by LoRA-based fine-tuning on 41K Urdu instruction-response pairs and 50K English–Urdu parallel sentence pairs. This hybrid adaptation strategy substantially enhances Urdu language understanding, instruction following, and English–Urdu translation capabilities. On three major machine translation benchmarks, UrduLLaMA 1.0 achieves BLEU scores surpassing prior state-of-the-art (SOTA) methods by +4.2–6.8 points. The work establishes a new paradigm and benchmark for high-quality, small-scale LLM adaptation in low-resource settings.

Technology Category

Application Category

📝 Abstract

Multilingual Large Language Models (LLMs) often provide suboptimal performance on low-resource languages like Urdu. This paper introduces UrduLLaMA 1.0, a model derived from the open-source Llama-3.1-8B-Instruct architecture and continually pre-trained on 128 million Urdu tokens, capturing the rich diversity of the language. To enhance instruction-following and translation capabilities, we leverage Low-Rank Adaptation (LoRA) to fine tune the model on 41,000 Urdu instructions and approximately 50,000 English-Urdu translation pairs. Evaluation across three machine translation datasets demonstrates significant performance improvements compared to state-of-the-art (SOTA) models, establishing a new benchmark for Urdu LLMs. These findings underscore the potential of targeted adaptation strategies with limited data and computational resources to address the unique challenges of low-resource languages.

Problem

Research questions and friction points this paper is trying to address.

Improves Urdu language model performance

Enhances instruction-following and translation capabilities

Addresses challenges in low-resource language settings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Continual pre-training on Urdu tokens

LoRA for fine-tuning instructions

Enhanced translation with English-Urdu pairs

🔎 Similar Papers

Benchmarking the Performance of Pre-trained LLMs across Urdu NLP Tasks