🤖 AI Summary
Existing agent evaluation methods rely on static benchmarks, which struggle to capture realistic failure modes in multi-step dynamic interactions and lack effective metrics for assessing interaction quality and coverage. To address these limitations, this work proposes VISTA—the first hybrid user simulation framework that integrates both UI and API interactions—and introduces a six-dimensional evaluation metric suite to comprehensively measure interaction realism, capability coverage, and effectiveness. Experiments in e-commerce shopping and educational customer service scenarios demonstrate that VISTA significantly enhances the realism and comprehensiveness of agent evaluations, generating assessment outcomes that are both more authentic and more broadly representative than those produced by current approaches.
📝 Abstract
Evaluation remains a critical bottleneck for interactive agent development. Existing evaluation methods often rely on static benchmarks, which fail to capture the dynamic, multi-step nature of agentic behavior and struggle to expose meaningful failure modes. While user-simulation-based evaluation offers a promising alternative, existing simulation frameworks suffer from two major limitations. First, they provide limited mechanisms for evaluating the quality and comprehensiveness of simulated interactions, making it difficult to assess whether a simulator sufficiently explores an agent's capabilities and failure modes. Second, most frameworks are restricted to either UI-only actions or API-only actions, limiting their ability to model the full range of realistic user behaviors. To address these limitations, we propose VISTA, a Versatile Interactive user Simulation Toolkit for Agent evaluation. Our toolkit includes a suite of six metrics for measuring the realism, capability coverage, and interaction effectiveness of simulated interactions. In addition, we develop a hybrid user simulator that integrates both UI-based interactions and API-based interactions, enabling more realistic and comprehensive evaluation across diverse interactive environments. We evaluate VISTA in e-commerce shopping and education customer service settings and demonstrate that it produces more realistic and comprehensive evaluations than existing methods.