🤖 AI Summary
Existing LLM evaluation methods suffer from three critical limitations in expert-level intelligent outbound call scenarios: narrow data coverage, unrealistic user simulation, and片面 evaluation metrics. This paper introduces the first domain-specific benchmark for this setting, spanning six industries and thirty sub-scenarios. We propose a structured process-decomposition modeling framework and a dual-dimensional, domain-adaptive evaluation system—assessing both task completion and interaction quality—augmented by a domain-aligned weighted scoring mechanism and a dynamic human-AI collaborative evaluation framework. Innovatively, we develop an LLM-driven virtual user simulator endowed with personality traits and emotion evolution, enabling high-fidelity interaction modeling. Empirical evaluation across twelve mainstream LLMs reveals a significant trade-off between task completion and interaction fluency. This work establishes a reproducible, scalable, and standardized evaluation paradigm—with empirical grounding—for professional outbound-call AI systems.
📝 Abstract
We propose OutboundEval, a comprehensive benchmark for evaluating large language models (LLMs) in expert-level intelligent outbound calling scenarios. Unlike existing methods that suffer from three key limitations - insufficient dataset diversity and category coverage, unrealistic user simulation, and inaccurate evaluation metrics - OutboundEval addresses these issues through a structured framework. First, we design a benchmark spanning six major business domains and 30 representative sub-scenarios, each with scenario-specific process decomposition, weighted scoring, and domain-adaptive metrics. Second, we develop a large-model-driven User Simulator that generates diverse, persona-rich virtual users with realistic behaviors, emotional variability, and communication styles, providing a controlled yet authentic testing environment. Third, we introduce a dynamic evaluation method that adapts to task variations, integrating automated and human-in-the-loop assessment to measure task execution accuracy, professional knowledge application, adaptability, and user experience quality. Experiments on 12 state-of-the-art LLMs reveal distinct trade-offs between expert-level task completion and interaction fluency, offering practical insights for building reliable, human-like outbound AI systems. OutboundEval establishes a practical, extensible, and domain-oriented standard for benchmarking LLMs in professional applications.