BOUTEF: A Multilingual Corpus for FakeNews in North Africa -- Language as a Weapon

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

151K/year

🤖 AI Summary

This study addresses the challenge of rapid disinformation spread on social media in North Africa’s multilingual, low-resource contexts by constructing BOUTEF, the first multilingual fake news corpus covering Algeria and Tunisia. BOUTEF integrates false and true narratives, user comments, and fact-checking rebuttals, systematically incorporating local linguistic varieties including Arabic dialects, Arabizi, and code-switching. Through multilingual data collection and annotation, topic modeling, sentiment analysis, and cross-regional comparison, the research reveals that disinformation leverages emotionally charged narratives and hybrid language strategies to enhance virality, whereas fact-checking content predominantly adopts an evidence-based verification style. User engagement significantly amplifies the visibility of false content. The work underscores the “weaponization” of linguistic practices in disinformation campaigns and publicly releases a high-quality annotated dataset, thereby filling a critical research gap in the North African region.

📝 Abstract

The rapid spread of fake news on social media has become a major challenge, particularly in multilingual and under-resourced contexts such as North Africa. In this paper, we introduce BOUTEF, a large-scale multilingual corpus designed to study the propagation, characteristics, and impact of fake news in Algeria and Tunisia. The corpus integrates three complementary components: fake narratives, genuine narratives, and associated user-generated comments, along with verified debunking information. It covers a wide range of languages and linguistic varieties, including MSA, Algerian and Tunisian dialects, Arabizi, French, English, and code-switched language. Building on this resource, we conduct a comprehensive empirical analysis combining quantitative and qualitative approaches. We examine thematic distributions, linguistic and rhetorical strategies, sentiment patterns, and social engagement dynamics. Statistical analyses reveal significant associations between thematic categories and message veracity, as well as strong correlations between user engagement and the visibility of fake content. Our findings show that fake news relies heavily on emotionally charged narratives, sensational framing, and hybrid linguistic practices that enhance virality and audience engagement. In contrast, debunking content adopts a more factual and verification-oriented style. Furthermore, a comparative analysis between Algeria and Tunisia highlights both shared dynamics and country-specific characteristics shaped by sociopolitical contexts. The results emphasize the role of informal language practices in the diffusion and reception of misinformation. By providing a rich, annotated, and publicly available dataset, this work contributes to advancing research on fake news detection, low-resource language processing, and the understanding of information disorders in complex linguistic environments.

Problem

Research questions and friction points this paper is trying to address.

fake news

multilingual corpus

North Africa

misinformation

language varieties

Innovation

Methods, ideas, or system contributions that make the work stand out.

multilingual corpus

fake news detection

low-resource languages