AfriHate: A Multilingual Collection of Hate Speech and Abusive Language Datasets for African Languages

📅 2025-01-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
African multilingual hate speech detection faces dual challenges of cultural misinterpretation and data scarcity. To address this, we introduce the first high-quality, culturally grounded dataset covering 15 African languages, annotated exclusively by native speakers within local sociocultural contexts, with sustained community involvement in both annotation and lexicon development. We propose a novel fine-grained, culture-sensitive annotation framework and publicly release a bilingual open-source lexicon, individual annotator metadata, and benchmark classification models—including both traditional machine learning and LLM-finetuning approaches. Experimental results demonstrate that our LLM-augmented methods significantly outperform zero-shot baselines across multiple cross-lingual hate speech classification tasks. This work systematically bridges critical gaps in content moderation for Global South low-resource languages—both in terms of culturally representative data and methodologically robust, community-informed modeling frameworks.

Technology Category

Application Category

📝 Abstract
Hate speech and abusive language are global phenomena that need socio-cultural background knowledge to be understood, identified, and moderated. However, in many regions of the Global South, there have been several documented occurrences of (1) absence of moderation and (2) censorship due to the reliance on keyword spotting out of context. Further, high-profile individuals have frequently been at the center of the moderation process, while large and targeted hate speech campaigns against minorities have been overlooked. These limitations are mainly due to the lack of high-quality data in the local languages and the failure to include local communities in the collection, annotation, and moderation processes. To address this issue, we present AfriHate: a multilingual collection of hate speech and abusive language datasets in 15 African languages. Each instance in AfriHate is annotated by native speakers familiar with the local culture. We report the challenges related to the construction of the datasets and present various classification baseline results with and without using LLMs. The datasets, individual annotations, and hate speech and offensive language lexicons are available on https://github.com/AfriHate/AfriHate
Problem

Research questions and friction points this paper is trying to address.

Hate Speech
African Languages
Data Deficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

AfriHate Multilingual Database
Culturally-sensitive Hate Speech Analysis
Advanced Data Classification Techniques
🔎 Similar Papers
No similar papers found.
Shamsuddeen Hassan Muhammad
Shamsuddeen Hassan Muhammad
Bayero University, Kano, & Google DeepMind Academic Fellow at Imperial College London
Natural Language ProcessingSentiment AnalysisAfricaNLPLow-resource NLPMultilinguality
Idris Abdulmumin
Idris Abdulmumin
Postdoctoral Fellow, DSFSI, University of Pretoria
Machine TranslationNeural Machine TranslationNatural Language ProcessingInternet Technology
Abinew Ali Ayele
Abinew Ali Ayele
University of Hamburg
Computational Social ScienceNatural Language ProcessingLow-resource LanguagesAI
David Ifeoluwa Adelani
David Ifeoluwa Adelani
McGill University and Mila - Quebec AI Institute and Canada CIFAR AI Chair
Natural language processingMultilingualityMultilingual NLPAfricaNLPLow-resource NLP
Ibrahim Said Ahmad
Ibrahim Said Ahmad
Northeastern University
Natural Language ProcessingBig DataData miningArtificial Intelligence
S
Saminu Mohammad Aliyu
Bayero University Kano
N
Nelson Odhiambo Onyango
Maseno University
L
Lilian D. A. Wanzare
Maseno University
S
Samuel Rutunda
Digital Umuganda
Lukman Jibril Aliyu
Lukman Jibril Aliyu
Zipline, Nigeria
Natural Language ProcessingBiomedical InformaticsHealth Systems Improvement
E
Esubalew Alemneh
Haramaya University
O
Oumaima Hourrane
Al Akhawayn University
H
Hagos Tesfahun Gebremichael
Bahir Dar University
E
Elyas Abdi Ismail
Haramaya University
M
Meriem Beloucif
Uppsala University
E
Ebrahim Chekol Jibril
Istanbul Technical University
A
Andiswa Bukula
SADiLaR
R
Rooweither Mabuya
SADiLaR
Salomey Osei
Salomey Osei
University of Deusto
Machine LearningNLPAuto ML
A
Abigail Oppong
Independent Researcher
Tadesse Destaw Belay
Tadesse Destaw Belay
Ph.D. candidate IPN, Mexico
NLP for Low-resource languagesMachine learningand LLMs
T
Tadesse Kebede Guge
Addis Ababa University
T
Tesfa Tegegne Asfaw
Bahir Dar University
C
Chiamaka Ijeoma Chukwuneke
Lancaster University
Paul Röttger
Paul Röttger
Postdoctoral Researcher, Bocconi University
Large Language ModelsSafety and Societal Impacts of AI Systems
S
Seid Muhie Yimam
University of Hamburg
Nedjma Ousidhoum
Nedjma Ousidhoum
Lecturer (Assistant Professor), Cardiff University
Natural Language ProcessingComputational Social ScienceMachine Learning