Subasa -- Adapting Language Models for Low-resourced Offensive Language Detection in Sinhala

📅 2025-04-02

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

To address the poor performance of offensive language detection for the low-resource Sinhala language in social media, this work proposes the Subasa model family and a novel intermediate-layer pre-finetuning paradigm. Methodologically, we introduce Masked Reason Prediction (MRP)—the first Sinhala-specific intermediate task designed to enhance model interpretability in identifying offensive cues. We further adapt XLM-R, Llama-3.2, and Mistral-v0.3 to MRP, yielding Subasa-XLM-R, Subasa-Llama, and Subasa-Mistral. Additionally, we design a multi-stage fine-tuning pipeline aligned with the SOLD benchmark. Experimental results show that Subasa-XLM-R achieves a zero-shot Macro F1 score of 0.84—significantly outperforming GPT-4o and all existing baselines. All models, datasets, and code are publicly released.

Technology Category

Application Category

📝 Abstract

Accurate detection of offensive language is essential for a number of applications related to social media safety. There is a sharp contrast in performance in this task between low and high-resource languages. In this paper, we adapt fine-tuning strategies that have not been previously explored for Sinhala in the downstream task of offensive language detection. Using this approach, we introduce four models:"Subasa-XLM-R", which incorporates an intermediate Pre-Finetuning step using Masked Rationale Prediction. Two variants of"Subasa-Llama"and"Subasa-Mistral", are fine-tuned versions of Llama (3.2) and Mistral (v0.3), respectively, with a task-specific strategy. We evaluate our models on the SOLD benchmark dataset for Sinhala offensive language detection. All our models outperform existing baselines. Subasa-XLM-R achieves the highest Macro F1 score (0.84) surpassing state-of-the-art large language models like GPT-4o when evaluated on the same SOLD benchmark dataset under zero-shot settings. The models and code are publicly available.

Problem

Research questions and friction points this paper is trying to address.

Improving offensive language detection in low-resource Sinhala.

Adapting fine-tuning strategies for Sinhala language models.

Outperforming existing baselines on SOLD benchmark dataset.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapts fine-tuning for Sinhala offensive language detection

Introduces Subasa-XLM-R with Masked Rationale Prediction

Fine-tunes Llama and Mistral with task-specific strategies

🔎 Similar Papers

No similar papers found.