🤖 AI Summary
Large language models (LLMs) exhibit significant bias and data scarcity in sentiment analysis and irony detection across non-mainstream English varieties (e.g., Australian, Indian, and British English), undermining linguistic fairness and robustness.
Method: We introduce BESSTIE—the first fine-grained, native-speaker-annotated cross-variety benchmark—built from authentic web comments filtered via dual geographic and topical criteria; we propose a hybrid human verification and automated prediction framework for language variety validation, and systematically evaluate nine monolingual and multilingual LLMs on cross-variety generalization.
Contribution/Results: Our evaluation reveals severe performance collapse—particularly for Indian English (en-IN)—in irony detection, with consistently poor cross-variety transfer across all models. To support equitable LLM evaluation and robustness research, we publicly release the BESSTIE dataset, annotation guidelines, training code, and fine-tuned models.
📝 Abstract
Despite large language models (LLMs) being known to exhibit bias against non-mainstream varieties, there are no known labeled datasets for sentiment analysis of English. To address this gap, we introduce BESSTIE, a benchmark for sentiment and sarcasm classification for three varieties of English: Australian (en-AU), Indian (en-IN), and British (en-UK). Using web-based content from two domains, namely, Google Place reviews and Reddit comments, we collect datasets for these language varieties using two methods: location-based and topic-based filtering. Native speakers of the language varieties manually annotate the datasets with sentiment and sarcasm labels. To assess whether the dataset accurately represents these varieties, we conduct two validation steps: (a) manual annotation of language varieties and (b) automatic language variety prediction. Subsequently, we fine-tune nine large language models (LLMs) (representing a range of encoder/decoder and mono/multilingual models) on these datasets, and evaluate their performance on the two tasks. Our results reveal that the models consistently perform better on inner-circle varieties (i.e., en-AU and en-UK), with significant performance drops for en-IN, particularly in sarcasm detection. We also report challenges in cross-variety generalisation, highlighting the need for language variety-specific datasets such as ours. BESSTIE promises to be a useful evaluative benchmark for future research in equitable LLMs, specifically in terms of language varieties. The BESSTIE datasets, code, and models will be publicly available upon acceptance.