Counterfactual Fairness Evaluation of LLM-Based Contact Center Agent Quality Assurance System

📅 2026-02-16

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses fairness concerns in large language models (LLMs) applied to customer service quality assurance, where biases related to agent identity, contextual cues, and behavioral styles—introduced during training—may lead to unfair evaluations. The authors construct the first counterfactual evaluation benchmark spanning 13 bias dimensions and systematically assess 18 LLMs on 3,000 real-world customer service dialogues. They propose a quantitative fairness metric combining Counterfactual Flip Rate (CFR) and Mean Absolute Score Difference (MASD), and conduct intervention analyses using fairness-aware prompts. Results reveal CFRs ranging from 5.4% to 13.0%, with contextual history inducing the most severe bias (CFR up to 16.4%). Explicit fairness prompting yields only marginal improvements, highlighting the critical influence of model scale and alignment on fairness outcomes.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are increasingly deployed in contact-center Quality Assurance (QA) to automate agent performance evaluation and coaching feedback. While LLMs offer unprecedented scalability and speed, their reliance on web-scale training data raises concerns regarding demographic and behavioral biases that may distort workforce assessment. We present a counterfactual fairness evaluation of LLM-based QA systems across 13 dimensions spanning three categories: Identity, Context, and Behavioral Style. Fairness is quantified using the Counterfactual Flip Rate (CFR), the frequency of binary judgment reversals, and the Mean Absolute Score Difference (MASD), the average shift in coaching or confidence scores across counterfactual pairs. Evaluating 18 LLMs on 3,000 real-world contact center transcripts, we find systematic disparities, with CFR ranging from 5.4% to 13.0% and consistent MASD shifts across confidence, positive, and improvement scores. Larger, more strongly aligned models show lower unfairness, though fairness does not track accuracy. Contextual priming of historical performance induces the most severe degradations (CFR up to 16.4%), while implicit linguistic identity cues remain a persistent bias source. Finally, we analyze the efficacy of fairness-aware prompting, finding that explicit instructions yield only modest improvements in evaluative consistency. Our findings underscore the need for standardized fairness auditing pipelines prior to deploying LLMs in high-stakes workforce evaluation.

Problem

Research questions and friction points this paper is trying to address.

Counterfactual Fairness

LLM Bias

Quality Assurance

Workforce Evaluation

Demographic Bias

Innovation

Methods, ideas, or system contributions that make the work stand out.

Counterfactual Fairness

Large Language Models

Quality Assurance

Bias Evaluation

Contact Center

🔎 Similar Papers

No similar papers found.

Authors to Follow