🤖 AI Summary
A lack of publicly available, Africa-centric automatic speech recognition (ASR) evaluation benchmarks hinders fair, cross-accent and cross-national model assessment and deployment. Method: We introduce the first multi-domain, multi-national ASR benchmark covering spoken English from over 10 African countries, encompassing 100+ regional accents and seven application domains. Our vertically structured evaluation framework integrates spontaneous and read speech, open- and closed-source ASR systems, and multimodal large language models (LLMs), employing fine-grained error analysis and joint latency–accuracy evaluation. Contribution/Results: Experiments reveal that open-source ASR models excel on spontaneous speech but suffer from poor noise robustness; multimodal LLMs exhibit strong accent robustness yet frequently misrecognize proper nouns; fine-tuned models balance accuracy and low latency but still generate pervasive hallucinations. This work establishes a novel paradigm for low-resource accent adaptation and provides a foundational resource for equitable ASR development in linguistically diverse African contexts.
📝 Abstract
Recent advances in speech-enabled AI, including Google's NotebookLM and OpenAI's speech-to-speech API, are driving widespread interest in voice interfaces globally. Despite this momentum, there exists no publicly available application-specific model evaluation that caters to Africa's linguistic diversity. We present AfriSpeech-MultiBench, the first domain-specific evaluation suite for over 100 African English accents across 10+ countries and seven application domains: Finance, Legal, Medical, General dialogue, Call Center, Named Entities and Hallucination Robustness. We benchmark a diverse range of open, closed, unimodal ASR and multimodal LLM-based speech recognition systems using both spontaneous and non-spontaneous speech conversation drawn from various open African accented English speech datasets. Our empirical analysis reveals systematic variation: open-source ASR models excels in spontaneous speech contexts but degrades on noisy, non-native dialogue; multimodal LLMs are more accent-robust yet struggle with domain-specific named entities; proprietary models deliver high accuracy on clean speech but vary significantly by country and domain. Models fine-tuned on African English achieve competitive accuracy with lower latency, a practical advantage for deployment, hallucinations still remain a big problem for most SOTA models. By releasing this comprehensive benchmark, we empower practitioners and researchers to select voice technologies suited to African use-cases, fostering inclusive voice applications for underserved communities.