Multi-task Learning with Active Learning for Arabic Offensive Speech Detection

📅 2025-06-03

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

Detecting offensive language in Arabic social media faces challenges including scarce labeled data, high dialectal diversity, and linguistic complexity. To address these, we propose a novel framework integrating multi-task learning (MTL) and active learning (AL). Our method employs a dynamic weighted MTL mechanism to jointly model violence, profanity, and offensiveness; adopts an uncertainty-based AL strategy to maximize labeling efficiency; and introduces a weighted emoji semantic embedding to better capture non-textual cues. Evaluated on the OSACT2022 benchmark, our approach achieves a state-of-the-art macro-F1 score of 85.42% using only ~40% of the fully supervised labeling budget—a 60% reduction in annotated samples. Key contributions include: (i) the first dynamic task-weighting scheme for Arabic offensive language detection, (ii) an uncertainty-driven cross-task AL paradigm, and (iii) a dialect-aware, emoji-semantic weighting strategy tailored to Arabic social text.

Technology Category

Application Category

📝 Abstract

The rapid growth of social media has amplified the spread of offensive, violent, and vulgar speech, which poses serious societal and cybersecurity concerns. Detecting such content in Arabic text is particularly complex due to limited labeled data, dialectal variations, and the language's inherent complexity. This paper proposes a novel framework that integrates multi-task learning (MTL) with active learning to enhance offensive speech detection in Arabic social media text. By jointly training on two auxiliary tasks, violent and vulgar speech, the model leverages shared representations to improve the detection accuracy of the offensive speech. Our approach dynamically adjusts task weights during training to balance the contribution of each task and optimize performance. To address the scarcity of labeled data, we employ an active learning strategy through several uncertainty sampling techniques to iteratively select the most informative samples for model training. We also introduce weighted emoji handling to better capture semantic cues. Experimental results on the OSACT2022 dataset show that the proposed framework achieves a state-of-the-art macro F1-score of 85.42%, outperforming existing methods while using significantly fewer fine-tuning samples. The findings of this study highlight the potential of integrating MTL with active learning for efficient and accurate offensive language detection in resource-constrained settings.

Problem

Research questions and friction points this paper is trying to address.

Detects offensive Arabic speech in social media with limited labeled data

Addresses dialectal variations and inherent Arabic language complexity

Improves detection accuracy using multi-task and active learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-task learning for shared representation enhancement

Active learning with uncertainty sampling techniques

Weighted emoji handling for semantic cue capture

🔎 Similar Papers

No similar papers found.