Far Out: Evaluating Language Models on Slang in Australian and Indian English

📅 2026-02-17

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the significant performance gap in current language models when interpreting region-specific slang from non-standard English varieties, such as Australian and Indian English, an area previously lacking systematic evaluation. The authors present the first dual-source dataset combining authentic web-collected corpora and synthetically generated data, and introduce three evaluation tasks: Target Word Prediction (TWP), Guided Prediction (TWP*), and Target Word Selection (TWS). Comprehensive experiments across seven state-of-the-art models reveal a pronounced asymmetry between generative and discriminative capabilities: TWS accuracy (0.49) substantially exceeds TWP performance (0.03). Furthermore, models perform better on web-derived data than synthetic data and demonstrate consistently stronger results on Indian English slang, achieving a TWS accuracy of 0.54 compared to 0.44 for Australian English.

Technology Category

Application Category

📝 Abstract

Language models exhibit systematic performance gaps when processing text in non-standard language varieties, yet their ability to comprehend variety-specific slang remains underexplored for several languages. We present a comprehensive evaluation of slang awareness in Indian English (en-IN) and Australian English (en-AU) across seven state-of-the-art language models. We construct two complementary datasets: \textsc{web}, containing 377 web-sourced usage examples from Urban Dictionary, and \textsc{gen}, featuring 1,492 synthetically generated usages of these slang terms, across diverse scenarios. We assess language models on three tasks: target word prediction (TWP), guided target word prediction (TWP$^*$) and target word selection (TWS). Our results reveal four key findings: (1) Higher average model performance TWS versus TWP and TWP$^*$, with average accuracy score increasing from 0.03 to 0.49 respectively (2) Stronger average model performance on \textsc{web} versus \textsc{gen} datasets, with average similarity score increasing by 0.03 and 0.05 across TWP and TWP$^*$ tasks respectively (3) en-IN tasks outperform en-AU when averaged across all models and datasets, with TWS demonstrating the largest disparity, increasing average accuracy from 0.44 to 0.54. These findings underscore fundamental asymmetries between generative and discriminative competencies for variety-specific language, particularly in the context of slang expressions despite being in a technologically rich language such as English.

Problem

Research questions and friction points this paper is trying to address.

language models

slang

Australian English

Indian English

non-standard language varieties

Innovation

Methods, ideas, or system contributions that make the work stand out.

slang evaluation

language variety

non-standard English

dataset construction

language model robustness

🔎 Similar Papers

No similar papers found.

Authors to Follow