RideKE: Leveraging Low-resource Twitter User-generated Content for Sentiment and Emotion Detection on Code-switched RHS Dataset.

📅 2025-02-10

🏛️ Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

📈 Citations: 1

✨ Influential: 0

career value

160K/year

🤖 AI Summary

This study addresses the challenges of low-resource, high-noise, and linguistically heterogeneous (e.g., slang, orthographic variation) Swahili–English code-switched tweets from Kenya for fine-grained sentiment and emotion classification. We introduce RHS, the first high-quality, manually annotated, open-source dataset for this task. We propose a unified evaluation framework integrating supervised learning and unsupervised domain adaptation (UDA)-style semi-supervised learning to systematically benchmark four multilingual models: XLM-R, DistilBERT, mBERT, and AfriBERTa. Results show that the supervised XLM-R achieves 69.2% accuracy and 66.1% F1 on sentiment classification, while DistilBERT attains the best emotion classification performance (59.8% accuracy, 31.0% F1), significantly outperforming localized models like AfriBERTa. The study reveals XLM-R’s superior cross-lingual generalization and exposes critical performance bottlenecks of semi-supervised methods under extremely limited labeled data. This work establishes a benchmark dataset and methodology for low-resource code-switched text analysis.

Technology Category

Application Category

📝 Abstract

Social media has become a crucial open-access platform enabling individuals to freely express opinions and share experiences. These platforms contain user-generated content facilitating instantaneous communication and feedback. However, leveraging low-resource language data from Twitter can be challenging due to the scarcity and poor quality of content with significant variations in language use, such as slang and code-switching. Automatically identifying tweets in low-resource languages can also be challenging because Twitter primarily supports high-resource languages; low-resource languages often lack robust linguistic and contextual support. This paper analyzes Kenyan code-switched data from Twitter using four transformer-based pretrained models for sentiment and emotion classification tasks using supervised and semi-supervised methods. We detail the methodology behind data collection, the annotation procedure, and the challenges encountered during the data curation phase. Our results show that XLM-R outperforms other models; for sentiment analysis, XLM-R supervised model achieves the highest accuracy (69.2%) and F1 score (66.1%), XLM-R semi-supervised (67.2% accuracy, 64.1% F1 score). In emotion analysis, DistilBERT supervised leads in accuracy (59.8%) and F1 score (31%), mBERT semi-supervised (accuracy (59% and F1 score 26.5%). AfriBERTa models show the lowest accuracy and F1 scores. This indicates that the semi-supervised method’s performance is constrained by the small labeled dataset.

Problem

Research questions and friction points this paper is trying to address.

Detect sentiment in Kenyan code-switched tweets

Evaluate emotion in low-resource Twitter data

Compare transformer models for language analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes transformer-based models for sentiment analysis.

Implements semi-supervised learning on Twitter data.

Focuses on Kenyan code-switched language detection.

🔎 Similar Papers

No similar papers found.