A Framework for Auditing Chatbots for Dialect-Based Quality-of-Service Harms

📅 2025-06-04

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study identifies dialect bias in large language model (LLM)-based chatbots, leading to inequitable service quality across dialect groups—termed “quality-of-service harm.” To address this, we propose the first dialect bias auditing framework tailored for LLMs: it employs prompt engineering to dynamically generate dialectal test inputs and establishes a black-box audit paradigm grounded in authentic multi-turn interactions. We formally define quality-of-service harm as a quantifiable, attributable core metric and enable tripartite auditing involving internal developers, external auditors, and end users. The framework integrates multidimensional response evaluation—including semantic consistency, information completeness, and politeness—and incorporates robustness testing via adversarial perturbations (e.g., spelling error injection). Empirical evaluation on Amazon Rufus reveals significantly degraded response quality for African American English speakers, with perturbations further amplifying bias—demonstrating the framework’s validity and practical urgency.

Technology Category

Application Category

📝 Abstract

Increasingly, individuals who engage in online activities are expected to interact with large language model (LLM)-based chatbots. Prior work has shown that LLMs can display dialect bias, which occurs when they produce harmful responses when prompted with text written in minoritized dialects. However, whether and how this bias propagates to systems built on top of LLMs, such as chatbots, is still unclear. We conduct a review of existing approaches for auditing LLMs for dialect bias and show that they cannot be straightforwardly adapted to audit LLM-based chatbots due to issues of substantive and ecological validity. To address this, we present a framework for auditing LLM-based chatbots for dialect bias by measuring the extent to which they produce quality-of-service harms, which occur when systems do not work equally well for different people. Our framework has three key characteristics that make it useful in practice. First, by leveraging dynamically generated instead of pre-existing text, our framework enables testing over any dialect, facilitates multi-turn conversations, and represents how users are likely to interact with chatbots in the real world. Second, by measuring quality-of-service harms, our framework aligns audit results with the real-world outcomes of chatbot use. Third, our framework requires only query access to an LLM-based chatbot, meaning that it can be leveraged equally effectively by internal auditors, external auditors, and even individual users in order to promote accountability. To demonstrate the efficacy of our framework, we conduct a case study audit of Amazon Rufus, a widely-used LLM-based chatbot in the customer service domain. Our results reveal that Rufus produces lower-quality responses to prompts written in minoritized English dialects, and that these quality-of-service harms are exacerbated by the presence of typos in prompts.

Problem

Research questions and friction points this paper is trying to address.

Auditing chatbots for dialect-based quality-of-service harms

Adapting bias detection frameworks for LLM-based chatbot systems

Measuring unequal response quality across minoritized English dialects

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic text generation for dialect testing

Measures quality-of-service harms directly

Requires only query access for auditing

🔎 Similar Papers

No similar papers found.

Authors to Follow