Disparities in LLM Reasoning Accuracy and Explanations: A Case Study on African American English

๐Ÿ“… 2025-03-06
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This study identifies a systematic degradation in the reasoning capabilities of large language models (LLMs) when processing African American English (AAE), particularly in social science and humanities tasksโ€”evidenced by a 12.7% average accuracy drop, a 34% reduction in reasoning chain length, and diminished explanation completeness and quality. Method: We introduce the first standardized evaluation framework integrating LLM-based dialect conversion with linguistic analysis, combining contrastive prompting, structured reasoning chain assessment, and domain-stratified evaluation protocols. Contribution/Results: Our empirical analysis reveals, for the first time, a dialect-dependent attenuation effect: reasoning chain complexity and explanatory quality degrade significantly under AAE input. This highlights a critical fairness gap in current LLMs across multilingual and multidialectal contexts. The work establishes a reproducible methodological foundation and provides key empirical evidence to guide the development of linguistically inclusive reasoning models.

Technology Category

Application Category

๐Ÿ“ Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities in reasoning tasks, leading to their widespread deployment. However, recent studies have highlighted concerning biases in these models, particularly in their handling of dialectal variations like African American English (AAE). In this work, we systematically investigate dialectal disparities in LLM reasoning tasks. We develop an experimental framework comparing LLM performance given Standard American English (SAE) and AAE prompts, combining LLM-based dialect conversion with established linguistic analyses. We find that LLMs consistently produce less accurate responses and simpler reasoning chains and explanations for AAE inputs compared to equivalent SAE questions, with disparities most pronounced in social science and humanities domains. These findings highlight systematic differences in how LLMs process and reason about different language varieties, raising important questions about the development and deployment of these systems in our multilingual and multidialectal world. Our code repository is publicly available at https://github.com/Runtaozhou/dialect_bias_eval.
Problem

Research questions and friction points this paper is trying to address.

Investigates dialectal disparities in LLM reasoning tasks.
Compares LLM performance between Standard and African American English.
Highlights accuracy and reasoning differences in social science domains.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed experimental framework for dialect comparison
Combined LLM-based dialect conversion with linguistic analysis
Publicly available code repository for dialect bias evaluation