🤖 AI Summary
To address poor instruction-following performance of large language models (LLMs) on low-resource languages due to scarce, high-quality instruction data, this paper proposes InstructLR—a novel framework featuring a two-tier quality control mechanism: (1) retrieval-augmented n-shot automated filtering and (2) human validation—designed to jointly ensure fluency, orthographic consistency, and task-structure validity (inspired by benchmarks such as MMLU). Leveraging LLM-driven generation coupled with RAG-enhanced prompt filtering, InstructLR enables scalable, high-fidelity instruction dataset construction. We curate three multilingual, multi-domain instruction datasets—ZarmaInstruct-50k, BambaraInstruct-50k, and FulfuldeInstruct-50k—each containing 50,000 samples. Experiments demonstrate substantial improvements in instruction adherence and text generation across multiple low-resource languages. This work establishes both a reusable methodology and publicly available benchmark resources for adapting LLMs to linguistically under-resourced settings.
📝 Abstract
Effective text generation and chat interfaces for low-resource languages (LRLs) remain a challenge for state-of-the-art large language models (LLMs) to support. This is mainly due to the difficulty of curating high-quality instruction datasets for LRLs, a limitation prevalent in the languages spoken across the African continent and other regions. Current approaches, such as automated translation and synthetic data generation, frequently yield outputs that lack fluency or even orthographic consistency. In this paper, we introduce InstructLR, a novel framework designed to generate high-quality instruction datasets for LRLs. Our approach integrates LLM-driven text generation with a dual-layer quality filtering mechanism: an automated filtering layer based on retrieval-augmented-generation (RAG)-based n-shot prompting, and a human-in-the-loop validation layer. Drawing inspiration from benchmarks such as MMLU in task definition, InstructLR has facilitated the creation of three multi-domain instruction benchmarks: ZarmaInstruct-50k, BambaraInstruct-50k, and FulfuldeInstruct-50k.