SAGE: A Service Agent Graph-guided Evaluation Benchmark

📅 2026-04-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing evaluation benchmarks for large language model–based customer service systems predominantly rely on static paradigms and single metrics, failing to capture the diversity of real-world user behaviors and strict adherence to standard operating procedures (SOPs). This work proposes a multi-agent dynamic evaluation framework that transforms unstructured SOPs into dynamic dialogue graphs and integrates user, service, and adjudicator agents with a rule engine to enable dual-axis automated assessment. The framework innovatively introduces an adversarial intent classification schema and a modular extensibility mechanism, facilitating low-cost cross-domain deployment and automated dialogue generation. Evaluations across six industrial scenarios involving 27 models reveal two key phenomena: an “execution gap” (high intent recognition accuracy but incorrect actions) and “empathy resilience” (maintaining polite surface behavior under high adversarial pressure), thereby validating the framework’s effectiveness.

Technology Category

Application Category

📝 Abstract

The development of Large Language Models (LLMs) has catalyzed automation in customer service, yet benchmarking their performance remains challenging. Existing benchmarks predominantly rely on static paradigms and single-dimensional metrics, failing to account for diverse user behaviors or the strict adherence to structured Standard Operating Procedures (SOPs) required in real-world deployments. To bridge this gap, we propose SAGE (Service Agent Graph-guided Evaluation), a universal multi-agent benchmark for automated, dual-axis assessment. SAGE formalizes unstructured SOPs into Dynamic Dialogue Graphs, enabling precise verification of logical compliance and comprehensive path coverage. We introduce an Adversarial Intent Taxonomy and a modular Extension Mechanism, enabling low-cost deployment across domains and facilitating automated dialogue data synthesis. Evaluation is conducted via a framework where Judge Agents and a Rule Engine analyze interactions between User and Service Agents to generate deterministic ground truth. Extensive experiments on 27 LLMs across 6 industrial scenarios reveal a significant ``Execution Gap''where models accurately classify intents but fail to derive correct subsequent actions. We also observe ``Empathy Resilience'', a phenomenon where models maintain polite conversational facades despite underlying logical failures under high adversarial intensity. Code and resources are available at https://anonymous.4open.science/r/SAGE-Bench-4CD3/.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Customer Service Automation

Benchmarking

Standard Operating Procedures

Evaluation Metrics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Dialogue Graphs

Adversarial Intent Taxonomy

Multi-agent Benchmark

Rule Engine

Execution Gap

🔎 Similar Papers

No similar papers found.

Authors to Follow