AAVENUE: Detecting LLM Biases on NLU Tasks in AAVE via a Novel Benchmark

📅 2024-08-27
🏛️ NLP4PI
📈 Citations: 10
Influential: 2
📄 PDF
🤖 AI Summary
Large language models (LLMs) exhibit systematic biases in natural language understanding for African American Vernacular English (AAVE), yet no dedicated benchmark exists to rigorously assess dialect fairness. Method: We introduce AAVENUE, the first benchmark explicitly designed for evaluating AAVE fairness. It replaces deterministic rule-based translation with a few-shot prompting paradigm for LLM-driven AAVE–SAE translation, augmented by native speaker co-verification and multidimensional human evaluation (fluency, coherence, intelligibility), alongside automated metrics (BARTScore) and task adaptation to GLUE/SuperGLUE. Contribution/Results: Experiments reveal that five state-of-the-art LLMs underperform significantly on AAVE tasks versus their Standard American English (SAE) counterparts, confirming pervasive dialect bias. Compared to the prior VALUE benchmark, AAVENUE improves translation authenticity, task representativeness, and evaluation reliability. It provides a reproducible, methodologically grounded framework to advance inclusive NLP.

Technology Category

Application Category

📝 Abstract
Detecting biases in natural language understanding (NLU) for African American Vernacular English (AAVE) is crucial to developing inclusive natural language processing (NLP) systems. To address dialect-induced performance discrepancies, we introduce AAVENUE (AAVE Natural Language Understanding Evaluation), a benchmark for evaluating large language model (LLM) performance on NLU tasks in AAVE and Standard American English (SAE). AAVENUE builds upon and extends existing benchmarks like VALUE, replacing deterministic syntactic and morphological transformations with a more flexible methodology leveraging LLM-based translation with few-shot prompting, improving performance across our evaluation metrics when translating key tasks from the GLUE and SuperGLUE benchmarks. We compare AAVENUE and VALUE translations using five popular LLMs and a comprehensive set of metrics including fluency, BARTScore, quality, coherence, and understandability. Additionally, we recruit fluent AAVE speakers to validate our translations for authenticity. Our evaluations reveal that LLMs consistently perform better on SAE tasks than AAVE-translated versions, underscoring inherent biases and highlighting the need for more inclusive NLP models.
Problem

Research questions and friction points this paper is trying to address.

Detecting LLM biases in African American Vernacular English NLU tasks
Evaluating performance discrepancies between AAVE and Standard American English
Developing inclusive NLP systems through novel benchmark methodology
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based translation with few-shot prompting
Replacing deterministic transformations with flexible methodology
Validating translations via fluent AAVE speakers
🔎 Similar Papers
No similar papers found.