🤖 AI Summary
Existing large language model (LLM)-based agents lack systematic, multimodal evaluation frameworks tailored to real-world e-commerce customer service scenarios.
Method: We introduce EC-Bench—the first multimodal agent benchmark specifically designed for e-commerce customer service—constructed from millions of real-world dialogues. It features a user-profile-driven dynamic simulation mechanism and a high-difficulty composite task suite encompassing cross-modal understanding, multi-turn reasoning, and real-time decision-making.
Contribution/Results: EC-Bench significantly enhances evaluation authenticity and challenge. Experiments reveal that state-of-the-art multimodal models (e.g., GPT-4o) achieve only 10–20% pass@3 accuracy, exposing critical gaps in operational capability. This work shifts e-commerce AI agent evaluation from isolated skill assessment toward end-to-end problem-solving performance. The benchmark code and data will be publicly released.
📝 Abstract
In this paper, we introduce ECom-Bench, the first benchmark framework for evaluating LLM agent with multimodal capabilities in the e-commerce customer support domain. ECom-Bench features dynamic user simulation based on persona information collected from real e-commerce customer interactions and a realistic task dataset derived from authentic e-commerce dialogues. These tasks, covering a wide range of business scenarios, are designed to reflect real-world complexities, making ECom-Bench highly challenging. For instance, even advanced models like GPT-4o achieve only a 10-20% pass^3 metric in our benchmark, highlighting the substantial difficulties posed by complex e-commerce scenarios. Upon publication, the code and data will be open-sourced to facilitate further research and development in this domain.