Multi-task Learning with Active Learning for Arabic Offensive Speech Detection

📅 2025-06-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Detecting offensive language in Arabic social media faces challenges including scarce labeled data, high dialectal diversity, and linguistic complexity. To address these, we propose a novel framework integrating multi-task learning (MTL) and active learning (AL). Our method employs a dynamic weighted MTL mechanism to jointly model violence, profanity, and offensiveness; adopts an uncertainty-based AL strategy to maximize labeling efficiency; and introduces a weighted emoji semantic embedding to better capture non-textual cues. Evaluated on the OSACT2022 benchmark, our approach achieves a state-of-the-art macro-F1 score of 85.42% using only ~40% of the fully supervised labeling budget—a 60% reduction in annotated samples. Key contributions include: (i) the first dynamic task-weighting scheme for Arabic offensive language detection, (ii) an uncertainty-driven cross-task AL paradigm, and (iii) a dialect-aware, emoji-semantic weighting strategy tailored to Arabic social text.

Technology Category

Application Category

📝 Abstract
The rapid growth of social media has amplified the spread of offensive, violent, and vulgar speech, which poses serious societal and cybersecurity concerns. Detecting such content in Arabic text is particularly complex due to limited labeled data, dialectal variations, and the language's inherent complexity. This paper proposes a novel framework that integrates multi-task learning (MTL) with active learning to enhance offensive speech detection in Arabic social media text. By jointly training on two auxiliary tasks, violent and vulgar speech, the model leverages shared representations to improve the detection accuracy of the offensive speech. Our approach dynamically adjusts task weights during training to balance the contribution of each task and optimize performance. To address the scarcity of labeled data, we employ an active learning strategy through several uncertainty sampling techniques to iteratively select the most informative samples for model training. We also introduce weighted emoji handling to better capture semantic cues. Experimental results on the OSACT2022 dataset show that the proposed framework achieves a state-of-the-art macro F1-score of 85.42%, outperforming existing methods while using significantly fewer fine-tuning samples. The findings of this study highlight the potential of integrating MTL with active learning for efficient and accurate offensive language detection in resource-constrained settings.
Problem

Research questions and friction points this paper is trying to address.

Detects offensive Arabic speech in social media with limited labeled data
Addresses dialectal variations and inherent Arabic language complexity
Improves detection accuracy using multi-task and active learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-task learning for shared representation enhancement
Active learning with uncertainty sampling techniques
Weighted emoji handling for semantic cue capture
🔎 Similar Papers
No similar papers found.
Aisha Alansari
Aisha Alansari
Graduate Assistant, Information and Computer Science Department, KFUPM
Machine LearningNatural Language ProcessingDeep LearningLLMs
H
H. Luqman
SDAIA-KFUPM Joint Research Center for Artificial Intelligence, King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia