🤖 AI Summary
This study addresses persistent challenges in NLP for three low-resource Nigerian languages—Hausa, Yoruba, and Igbo—including severe data scarcity, incomplete task coverage, and inadequate modeling of language-specific features (e.g., diacritic representation). It presents the first systematic, multi-language survey of these languages. Through quantitative resource analysis, meta-evaluation of existing literature, and downstream task mapping, the work comprehensively characterizes the current state: only 25.1% of studies introduce novel resources; annotation quality is inconsistent; and language-specific phenomena remain largely unmodeled. The study produces a fine-grained resource gap map and proposes a tripartite development framework—“resource enrichment, collaborative annotation, and open co-construction.” Additionally, it introduces a standardized, reusable NLP evaluation framework specifically designed for African low-resource languages, enabling reproducible benchmarking and cross-lingual comparison.
📝 Abstract
With over 500 languages in Nigeria, three languages -- Hausa, Yor`ub'a and Igbo -- spoken by over 175 million people, account for about 60% of the spoken languages. However, these languages are categorised as low-resource due to insufficient resources to support tasks in computational linguistics. Several research efforts and initiatives have been presented, however, a coherent understanding of the state of Natural Language Processing (NLP) - from grammatical formalisation to linguistic resources that support complex tasks such as language understanding and generation is lacking. This study presents the first comprehensive review of advancements in low-resource NLP (LR-NLP) research across the three major Nigerian languages (NaijaNLP). We quantitatively assess the available linguistic resources and identify key challenges. Although a growing body of literature addresses various NLP downstream tasks in Hausa, Igbo, and Yor`ub'a, only about 25.1% of the reviewed studies contribute new linguistic resources. This finding highlights a persistent reliance on repurposing existing data rather than generating novel, high-quality resources. Additionally, language-specific challenges, such as the accurate representation of diacritics, remain under-explored. To advance NaijaNLP and LR-NLP more broadly, we emphasise the need for intensified efforts in resource enrichment, comprehensive annotation, and the development of open collaborative initiatives.