MixSarc: A Bangla-English Code-Mixed Corpus for Implicit Meaning Identification

📅 2026-02-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of detecting implicit meanings—such as sarcasm, humor, offensiveness, and vulgarity—in Bengali-English code-mixed social media content, a task hindered by scarce annotated data and complexities like code-switching and culture-specific references. The authors present MixSarc, the first publicly available multilabel corpus comprising 9,087 systematically collected and rigorously annotated sentences. Benchmark experiments using fine-tuned Transformers and zero-shot large language models show relatively strong performance in humor detection, while sarcasm recognition remains limited by data imbalance and pragmatic complexity. Zero-shot models achieve competitive micro-F1 scores but exhibit low exact-match accuracy. Notably, over 42% of negative sentiment instances in external datasets display sarcastic traits. This work provides the first culturally grounded resource for implicit meaning analysis in South Asian code-mixed contexts, advancing research in multilingual pragmatic understanding.

Technology Category

Application Category

📝 Abstract
Bangla-English code-mixing is widespread across South Asian social media, yet resources for implicit meaning identification in this setting remain scarce. Existing sentiment and sarcasm models largely focus on monolingual English or high-resource languages and struggle with transliteration variation, cultural references, and intra-sentential language switching. To address this gap, we introduce MixSarc, the first publicly available Bangla-English code-mixed corpus for implicit meaning identification. The dataset contains 9,087 manually annotated sentences labeled for humor, sarcasm, offensiveness, and vulgarity. We construct the corpus through targeted social media collection, systematic filtering, and multi-annotator validation. We benchmark transformer-based models and evaluate zero-shot large language models under structured prompting. Results show strong performance on humor detection but substantial degradation on sarcasm, offense, and vulgarity due to class imbalance and pragmatic complexity. Zero-shot models achieve competitive micro-F1 scores but low exact match accuracy. Further analysis reveals that over 42\% of negative sentiment instances in an external dataset exhibit sarcastic characteristics. MixSarc provides a foundational resource for culturally aware NLP and supports more reliable multi-label modeling in code-mixed environments.
Problem

Research questions and friction points this paper is trying to address.

code-mixing
implicit meaning identification
sarcasm detection
Bangla-English
social media NLP
Innovation

Methods, ideas, or system contributions that make the work stand out.

code-mixed NLP
implicit meaning identification
sarcasm detection
Bangla-English corpus
zero-shot prompting
🔎 Similar Papers
No similar papers found.
K
Kazi Samin Yasar Alam
Department of Computer Science and Engineering, Islamic University of Technology, Dhaka, Bangladesh
M
Md Tanbir Chowdhury
Department of Computer Science and Engineering, Islamic University of Technology, Dhaka, Bangladesh
T
Tamim Ahmed
Department of Computer Science and Engineering, Islamic University of Technology, Dhaka, Bangladesh
Ajwad Abrar
Ajwad Abrar
Junior Lecturer, IUT
Natural Language ProcessingHuman Computer InteractionSoftware Engineering
M
Md Rafid Haque
Department of Computer Science, University of Illinois at Chicago, Chicago, United States