🤖 AI Summary
Existing evaluation benchmarks for large language model–based customer service systems predominantly rely on static paradigms and single metrics, failing to capture the diversity of real-world user behaviors and strict adherence to standard operating procedures (SOPs). This work proposes a multi-agent dynamic evaluation framework that transforms unstructured SOPs into dynamic dialogue graphs and integrates user, service, and adjudicator agents with a rule engine to enable dual-axis automated assessment. The framework innovatively introduces an adversarial intent classification schema and a modular extensibility mechanism, facilitating low-cost cross-domain deployment and automated dialogue generation. Evaluations across six industrial scenarios involving 27 models reveal two key phenomena: an “execution gap” (high intent recognition accuracy but incorrect actions) and “empathy resilience” (maintaining polite surface behavior under high adversarial pressure), thereby validating the framework’s effectiveness.
📝 Abstract
The development of Large Language Models (LLMs) has catalyzed automation in customer service, yet benchmarking their performance remains challenging. Existing benchmarks predominantly rely on static paradigms and single-dimensional metrics, failing to account for diverse user behaviors or the strict adherence to structured Standard Operating Procedures (SOPs) required in real-world deployments. To bridge this gap, we propose SAGE (Service Agent Graph-guided Evaluation), a universal multi-agent benchmark for automated, dual-axis assessment. SAGE formalizes unstructured SOPs into Dynamic Dialogue Graphs, enabling precise verification of logical compliance and comprehensive path coverage. We introduce an Adversarial Intent Taxonomy and a modular Extension Mechanism, enabling low-cost deployment across domains and facilitating automated dialogue data synthesis. Evaluation is conducted via a framework where Judge Agents and a Rule Engine analyze interactions between User and Service Agents to generate deterministic ground truth. Extensive experiments on 27 LLMs across 6 industrial scenarios reveal a significant ``Execution Gap''where models accurately classify intents but fail to derive correct subsequent actions. We also observe ``Empathy Resilience'', a phenomenon where models maintain polite conversational facades despite underlying logical failures under high adversarial intensity. Code and resources are available at https://anonymous.4open.science/r/SAGE-Bench-4CD3/.