SAGE: A Service Agent Graph-guided Evaluation Benchmark

📅 2026-04-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing evaluation benchmarks for large language model–based customer service systems predominantly rely on static paradigms and single metrics, failing to capture the diversity of real-world user behaviors and strict adherence to standard operating procedures (SOPs). This work proposes a multi-agent dynamic evaluation framework that transforms unstructured SOPs into dynamic dialogue graphs and integrates user, service, and adjudicator agents with a rule engine to enable dual-axis automated assessment. The framework innovatively introduces an adversarial intent classification schema and a modular extensibility mechanism, facilitating low-cost cross-domain deployment and automated dialogue generation. Evaluations across six industrial scenarios involving 27 models reveal two key phenomena: an “execution gap” (high intent recognition accuracy but incorrect actions) and “empathy resilience” (maintaining polite surface behavior under high adversarial pressure), thereby validating the framework’s effectiveness.

Technology Category

Application Category

📝 Abstract
The development of Large Language Models (LLMs) has catalyzed automation in customer service, yet benchmarking their performance remains challenging. Existing benchmarks predominantly rely on static paradigms and single-dimensional metrics, failing to account for diverse user behaviors or the strict adherence to structured Standard Operating Procedures (SOPs) required in real-world deployments. To bridge this gap, we propose SAGE (Service Agent Graph-guided Evaluation), a universal multi-agent benchmark for automated, dual-axis assessment. SAGE formalizes unstructured SOPs into Dynamic Dialogue Graphs, enabling precise verification of logical compliance and comprehensive path coverage. We introduce an Adversarial Intent Taxonomy and a modular Extension Mechanism, enabling low-cost deployment across domains and facilitating automated dialogue data synthesis. Evaluation is conducted via a framework where Judge Agents and a Rule Engine analyze interactions between User and Service Agents to generate deterministic ground truth. Extensive experiments on 27 LLMs across 6 industrial scenarios reveal a significant ``Execution Gap''where models accurately classify intents but fail to derive correct subsequent actions. We also observe ``Empathy Resilience'', a phenomenon where models maintain polite conversational facades despite underlying logical failures under high adversarial intensity. Code and resources are available at https://anonymous.4open.science/r/SAGE-Bench-4CD3/.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Customer Service Automation
Benchmarking
Standard Operating Procedures
Evaluation Metrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Dialogue Graphs
Adversarial Intent Taxonomy
Multi-agent Benchmark
Rule Engine
Execution Gap
🔎 Similar Papers
No similar papers found.
Ling Shi
Ling Shi
Tianjin University
NLPLLM
Yuqin Dai
Yuqin Dai
Tsinghua University
LLMAI4ScienceAvatarGenerative Model
Z
Ziyin Wang
Tianjin University
N
Ning Gao
Beihang University
W
Wei Zhang
Beijing University of Posts and Telecommunications
Chaozheng Wang
Chaozheng Wang
The Chinese University of Hong Kong
software engineeringartificial intelligence
Y
Yujie Wang
Independent Researcher
W
Wei He
Independent Researcher
J
Jinpeng Wang
Independent Researcher
Deyi Xiong
Deyi Xiong
Professor, College of Intelligence and Computing, Tianjin University, China
Natural Language ProcessingLarge Language ModelsAI4Science