GeeSanBhava: Sentiment Tagged Sinhala Music Video Comment Data Set

📅 2025-11-22

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This study addresses the scarcity of annotated resources for Sinhala sentiment analysis and inherent biases in user reviews by introducing GeeSanBhava—the first high-quality, manually annotated Sinhala YouTube music-comment sentiment dataset—grounded in Russell’s valence-arousal model and annotated collaboratively by multiple annotators (Fleiss’ κ = 0.8496). Methodologically, we propose a cross-modal sentiment correlation modeling framework that explicitly contrasts sentiment mappings between textual comments and musical audio features; sentiment classification is performed using a Sinhala news–pretrained language model augmented with a hyperparameter-optimized three-layer MLP (256–128–64), achieving fine-grained classification (ROC-AUC = 0.887). Key contributions include: (1) establishing the first benchmark dataset for Sinhala music sentiment analysis; (2) empirically validating systematic discrepancies between comment sentiment and musical affect; and (3) enabling zero-shot transfer learning and cross-modal bias mitigation research.

Technology Category

Application Category

📝 Abstract

This study introduce GeeSanBhava, a high-quality data set of Sinhala song comments extracted from YouTube manually tagged using Russells Valence-Arousal model by three independent human annotators. The human annotators achieve a substantial inter-annotator agreement (Fleiss kappa = 84.96%). The analysis revealed distinct emotional profiles for different songs, highlighting the importance of comment based emotion mapping. The study also addressed the challenges of comparing comment-based and song-based emotions, mitigating biases inherent in user-generated content. A number of Machine learning and deep learning models were pre-trained on a related large data set of Sinhala News comments in order to report the zero-shot result of our Sinhala YouTube comment data set. An optimized Multi-Layer Perceptron model, after extensive hyperparameter tuning, achieved a ROC-AUC score of 0.887. The model is a three-layer MLP with a configuration of 256, 128, and 64 neurons. This research contributes a valuable annotated dataset and provides insights for future work in Sinhala Natural Language Processing and music emotion recognition.

Problem

Research questions and friction points this paper is trying to address.

Creating a sentiment-tagged dataset for Sinhala YouTube music comments

Addressing challenges in comparing comment-based and song-based emotions

Developing machine learning models for Sinhala music emotion recognition

Innovation

Methods, ideas, or system contributions that make the work stand out.

Manually annotated Sinhala YouTube comment dataset

Pre-trained models on related Sinhala news comments

Optimized three-layer MLP with hyperparameter tuning

🔎 Similar Papers

Are we there yet? A brief survey of Music Emotion Prediction Datasets, Models and Outstanding Challenges