The NaijaVoices Dataset: Cultivating Large-Scale, High-Quality, Culturally-Rich Speech Data for African Languages

📅 2025-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
African languages—particularly Nigeria’s three major languages (Igbo, Hausa, and Yoruba)—have long suffered from a severe scarcity of high-quality speech data, resulting in critically underdeveloped speech technologies for over one billion users. To address this, we introduce the largest culturally adapted speech dataset to date: 1,800 hours of audio from 5,000+ speakers, spanning diverse geographic regions, age groups, and accents, accompanied by rigorous quality control and annotation protocols. We propose a novel crowdsourcing paradigm explicitly designed to balance cultural representativeness, acoustic diversity, and scalability. Fine-tuning state-of-the-art ASR models—including Whisper, MMS, and XLSR—on our dataset yields average WER reductions of 75.86%, 52.06%, and 42.33%, respectively. This work marks the first systematic advancement in both accuracy and cross-dialect generalization for African language ASR.

Technology Category

Application Category

📝 Abstract
The development of high-performing, robust, and reliable speech technologies depends on large, high-quality datasets. However, African languages -- including our focus, Igbo, Hausa, and Yoruba -- remain under-represented due to insufficient data. Popular voice-enabled technologies do not support any of the 2000+ African languages, limiting accessibility for circa one billion people. While previous dataset efforts exist for the target languages, they lack the scale and diversity needed for robust speech models. To bridge this gap, we introduce the NaijaVoices dataset, a 1,800-hour speech-text dataset with 5,000+ speakers. We outline our unique data collection approach, analyze its acoustic diversity, and demonstrate its impact through finetuning experiments on automatic speech recognition, averagely achieving 75.86% (Whisper), 52.06% (MMS), and 42.33% (XLSR) WER improvements. These results highlight NaijaVoices' potential to advance multilingual speech processing for African languages.
Problem

Research questions and friction points this paper is trying to address.

Lack of large, high-quality datasets for African languages
Under-representation of Igbo, Hausa, and Yoruba in speech technologies
Insufficient scale and diversity in existing African language datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale 1,800-hour African speech dataset
Diverse data from 5,000+ speakers collected
Improved speech recognition WER by 42-76%
🔎 Similar Papers
No similar papers found.
C
Chris Emezue
Mila - Quebec AI Institute, University of Montreal, Canada
T
The NaijaVoices Community
NaijaVoices
Busayo Awobade
Busayo Awobade
Research Scientist, MLCollective
Speech processingMultilinguality.
Abraham Owodunni
Abraham Owodunni
The Ohio State University
Multilingual NLPLow-resource NLPEfficient ML
H
Handel Emezue
NaijaVoices, Alex Ekwueme Federal University Ndufu Alike Ikwo, Nigeria
G
Gloria Monica Tobechukwu Emezue
NaijaVoices, Alex Ekwueme Federal University Ndufu Alike Ikwo, Nigeria
N
Nefertiti Nneoma Emezue
NaijaVoices, Alex Ekwueme Federal University Ndufu Alike Ikwo, Nigeria
S
Sewade Ogun
INRIA, France
B
Bunmi Akinremi
Obafemi Awolowo University, Nigeria
David Ifeoluwa Adelani
David Ifeoluwa Adelani
McGill University and Mila - Quebec AI Institute and Canada CIFAR AI Chair
Natural language processingMultilingualityMultilingual NLPAfricaNLPLow-resource NLP