InstructLR: A Scalable Approach to Create Instruction Dataset for Under-Resourced Languages

📅 2025-12-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address poor instruction-following performance of large language models (LLMs) on low-resource languages due to scarce, high-quality instruction data, this paper proposes InstructLR—a novel framework featuring a two-tier quality control mechanism: (1) retrieval-augmented n-shot automated filtering and (2) human validation—designed to jointly ensure fluency, orthographic consistency, and task-structure validity (inspired by benchmarks such as MMLU). Leveraging LLM-driven generation coupled with RAG-enhanced prompt filtering, InstructLR enables scalable, high-fidelity instruction dataset construction. We curate three multilingual, multi-domain instruction datasets—ZarmaInstruct-50k, BambaraInstruct-50k, and FulfuldeInstruct-50k—each containing 50,000 samples. Experiments demonstrate substantial improvements in instruction adherence and text generation across multiple low-resource languages. This work establishes both a reusable methodology and publicly available benchmark resources for adapting LLMs to linguistically under-resourced settings.

Technology Category

Application Category

📝 Abstract
Effective text generation and chat interfaces for low-resource languages (LRLs) remain a challenge for state-of-the-art large language models (LLMs) to support. This is mainly due to the difficulty of curating high-quality instruction datasets for LRLs, a limitation prevalent in the languages spoken across the African continent and other regions. Current approaches, such as automated translation and synthetic data generation, frequently yield outputs that lack fluency or even orthographic consistency. In this paper, we introduce InstructLR, a novel framework designed to generate high-quality instruction datasets for LRLs. Our approach integrates LLM-driven text generation with a dual-layer quality filtering mechanism: an automated filtering layer based on retrieval-augmented-generation (RAG)-based n-shot prompting, and a human-in-the-loop validation layer. Drawing inspiration from benchmarks such as MMLU in task definition, InstructLR has facilitated the creation of three multi-domain instruction benchmarks: ZarmaInstruct-50k, BambaraInstruct-50k, and FulfuldeInstruct-50k.
Problem

Research questions and friction points this paper is trying to address.

Generates high-quality instruction datasets for low-resource languages
Addresses fluency and consistency issues in automated translation methods
Creates scalable benchmarks for under-resourced African languages
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-driven generation with dual-layer quality filtering
Automated filtering using RAG-based n-shot prompting
Human-in-the-loop validation for dataset creation
🔎 Similar Papers
No similar papers found.
M
Mamadou K. Keita
Rochester Institute of Technology
S
Sébastien Diarra
RobotsMali
Christopher Homan
Christopher Homan
Rochester Institute of Technology
Computer Science
S
Seydou Diallo
MALIBA-AI