🤖 AI Summary
This study addresses the evaluation and enhancement of large language model (LLM) capabilities for low-resource African languages. Focusing on natural language inference, mathematical reasoning, and multiple-choice knowledge question answering, we introduce IrokoBench—the first typologically balanced, human-translated, multi-task benchmark covering 17 African languages. We propose a novel test-set translation paradigm to systematically assess zero-shot and few-shot performance across 10 open-source and 6 closed-source LLMs. Our analysis reveals, for the first time, a performance compensation effect under test-set translation for English-centric models, quantifying a substantial gap between LLM performance on African languages versus English: average accuracy is only 60% of English performance, with Gemma 2 27B achieving merely 63% of GPT-4o’s score. Crucially, test-set translation significantly boosts performance of strong English-language models—including Gemma 2 27B and LLaMA 3.1 70B—on African languages.
📝 Abstract
Despite the widespread adoption of Large language models (LLMs), their remarkable capabilities remain limited to a few high-resource languages. Additionally, many low-resource languages (eg African languages) are often evaluated only on basic text classification tasks due to the lack of appropriate or comprehensive benchmarks outside of high-resource languages. In this paper, we introduce IrokoBench -- a human-translated benchmark dataset for 17 typologically-diverse low-resource African languages covering three tasks: natural language inference~(AfriXNLI), mathematical reasoning~(AfriMGSM), and multi-choice knowledge-based question answering~(AfriMMLU). We use IrokoBench to evaluate zero-shot, few-shot, and translate-test settings~(where test sets are translated into English) across 10 open and six proprietary LLMs. Our evaluation reveals a significant performance gap between high-resource languages~(such as English and French) and low-resource African languages. We observe a significant performance gap between open and proprietary models, with the highest performing open model, Gemma 2 27B only at 63% of the best-performing proprietary model GPT-4o performance. In addition, machine translating the test set to English before evaluation helped to close the gap for larger models that are English-centric, such as Gemma 2 27B and LLaMa 3.1 70B. These findings suggest that more efforts are needed to develop and adapt LLMs for African languages.