NaijaNLP: A Survey of Nigerian Low-Resource Languages

📅 2025-02-27

📈 Citations: 0

✨ Influential: 0

career value

152K/year

🤖 AI Summary

This study addresses persistent challenges in NLP for three low-resource Nigerian languages—Hausa, Yoruba, and Igbo—including severe data scarcity, incomplete task coverage, and inadequate modeling of language-specific features (e.g., diacritic representation). It presents the first systematic, multi-language survey of these languages. Through quantitative resource analysis, meta-evaluation of existing literature, and downstream task mapping, the work comprehensively characterizes the current state: only 25.1% of studies introduce novel resources; annotation quality is inconsistent; and language-specific phenomena remain largely unmodeled. The study produces a fine-grained resource gap map and proposes a tripartite development framework—“resource enrichment, collaborative annotation, and open co-construction.” Additionally, it introduces a standardized, reusable NLP evaluation framework specifically designed for African low-resource languages, enabling reproducible benchmarking and cross-lingual comparison.

Technology Category

Application Category

📝 Abstract

With over 500 languages in Nigeria, three languages -- Hausa, Yor`ub'a and Igbo -- spoken by over 175 million people, account for about 60% of the spoken languages. However, these languages are categorised as low-resource due to insufficient resources to support tasks in computational linguistics. Several research efforts and initiatives have been presented, however, a coherent understanding of the state of Natural Language Processing (NLP) - from grammatical formalisation to linguistic resources that support complex tasks such as language understanding and generation is lacking. This study presents the first comprehensive review of advancements in low-resource NLP (LR-NLP) research across the three major Nigerian languages (NaijaNLP). We quantitatively assess the available linguistic resources and identify key challenges. Although a growing body of literature addresses various NLP downstream tasks in Hausa, Igbo, and Yor`ub'a, only about 25.1% of the reviewed studies contribute new linguistic resources. This finding highlights a persistent reliance on repurposing existing data rather than generating novel, high-quality resources. Additionally, language-specific challenges, such as the accurate representation of diacritics, remain under-explored. To advance NaijaNLP and LR-NLP more broadly, we emphasise the need for intensified efforts in resource enrichment, comprehensive annotation, and the development of open collaborative initiatives.

Problem

Research questions and friction points this paper is trying to address.

Addressing low-resource NLP in Nigerian languages.

Reviewing advancements in Hausa, Igbo, Yoruba NLP.

Identifying challenges in linguistic resource development.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Low-resource NLP review

Quantitative linguistic assessment

Resource enrichment emphasis

🔎 Similar Papers

Does Generative AI speak Nigerian-Pidgin?: Issues about Representativeness and Bias for Multilingualism in LLMs