The NaijaVoices Dataset: Cultivating Large-Scale, High-Quality, Culturally-Rich Speech Data for African Languages

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

African languages—particularly Nigeria’s three major languages (Igbo, Hausa, and Yoruba)—have long suffered from a severe scarcity of high-quality speech data, resulting in critically underdeveloped speech technologies for over one billion users. To address this, we introduce the largest culturally adapted speech dataset to date: 1,800 hours of audio from 5,000+ speakers, spanning diverse geographic regions, age groups, and accents, accompanied by rigorous quality control and annotation protocols. We propose a novel crowdsourcing paradigm explicitly designed to balance cultural representativeness, acoustic diversity, and scalability. Fine-tuning state-of-the-art ASR models—including Whisper, MMS, and XLSR—on our dataset yields average WER reductions of 75.86%, 52.06%, and 42.33%, respectively. This work marks the first systematic advancement in both accuracy and cross-dialect generalization for African language ASR.

Technology Category

Application Category

📝 Abstract

The development of high-performing, robust, and reliable speech technologies depends on large, high-quality datasets. However, African languages -- including our focus, Igbo, Hausa, and Yoruba -- remain under-represented due to insufficient data. Popular voice-enabled technologies do not support any of the 2000+ African languages, limiting accessibility for circa one billion people. While previous dataset efforts exist for the target languages, they lack the scale and diversity needed for robust speech models. To bridge this gap, we introduce the NaijaVoices dataset, a 1,800-hour speech-text dataset with 5,000+ speakers. We outline our unique data collection approach, analyze its acoustic diversity, and demonstrate its impact through finetuning experiments on automatic speech recognition, averagely achieving 75.86% (Whisper), 52.06% (MMS), and 42.33% (XLSR) WER improvements. These results highlight NaijaVoices' potential to advance multilingual speech processing for African languages.

Problem

Research questions and friction points this paper is trying to address.

Lack of large, high-quality datasets for African languages

Under-representation of Igbo, Hausa, and Yoruba in speech technologies

Insufficient scale and diversity in existing African language datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale 1,800-hour African speech dataset

Diverse data from 5,000+ speakers collected

Improved speech recognition WER by 42-76%

🔎 Similar Papers

No similar papers found.