Spot the BlindSpots: Systematic Identification and Quantification of Fine-Grained LLM Biases in Contact Center Summaries

📅 2025-08-18

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This paper identifies and systematically investigates “operational bias”—a novel class of model-induced distortion in large language model (LLM)-generated summaries for contact center dialogues, distinct from social or geographic biases and rooted in domain-specific business logic and workflow constraints. To address this, we propose BlindSpot: (1) the first fine-grained taxonomy of 15 operational bias dimensions; (2) a zero-shot LLM-based classifier that automatically infers categorical distributions over both raw calls and summaries across all dimensions—without human annotation; and (3) two quantitative metrics—Fidelity Gap (measured via Jensen–Shannon divergence) and Coverage—to assess bias magnitude. Experiments on 2,500 real-world customer service calls and 20 state-of-the-art LLMs demonstrate that all models exhibit significant, consistent, and architecture-agnostic operational bias—unmitigated by scaling or architectural advances.

Technology Category

Application Category

📝 Abstract

Abstractive summarization is a core application in contact centers, where Large Language Models (LLMs) generate millions of summaries of call transcripts daily. Despite their apparent quality, it remains unclear whether LLMs systematically under- or over-attend to specific aspects of the transcript, potentially introducing biases in the generated summary. While prior work has examined social and positional biases, the specific forms of bias pertinent to contact center operations - which we term Operational Bias - have remained unexplored. To address this gap, we introduce BlindSpot, a framework built upon a taxonomy of 15 operational bias dimensions (e.g., disfluency, speaker, topic) for the identification and quantification of these biases. BlindSpot leverages an LLM as a zero-shot classifier to derive categorical distributions for each bias dimension in a pair of transcript and its summary. The bias is then quantified using two metrics: Fidelity Gap (the JS Divergence between distributions) and Coverage (the percentage of source labels omitted). Using BlindSpot, we conducted an empirical study with 2500 real call transcripts and their summaries generated by 20 LLMs of varying scales and families (e.g., GPT, Llama, Claude). Our analysis reveals that biases are systemic and present across all evaluated models, regardless of size or family.

Problem

Research questions and friction points this paper is trying to address.

Identifies biases in LLM-generated contact center summaries

Quantifies operational biases across 15 specific dimensions

Evaluates biases in 20 LLMs using real call transcripts

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM zero-shot classifier for bias detection

Taxonomy of 15 operational bias dimensions

Fidelity Gap and Coverage metrics

🔎 Similar Papers

No similar papers found.

Authors to Follow