VISTA: A Versatile Interactive User Simulation Toolkit for Agent Evaluation

📅 2026-06-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing agent evaluation methods rely on static benchmarks, which struggle to capture realistic failure modes in multi-step dynamic interactions and lack effective metrics for assessing interaction quality and coverage. To address these limitations, this work proposes VISTA—the first hybrid user simulation framework that integrates both UI and API interactions—and introduces a six-dimensional evaluation metric suite to comprehensively measure interaction realism, capability coverage, and effectiveness. Experiments in e-commerce shopping and educational customer service scenarios demonstrate that VISTA significantly enhances the realism and comprehensiveness of agent evaluations, generating assessment outcomes that are both more authentic and more broadly representative than those produced by current approaches.

📝 Abstract

Evaluation remains a critical bottleneck for interactive agent development. Existing evaluation methods often rely on static benchmarks, which fail to capture the dynamic, multi-step nature of agentic behavior and struggle to expose meaningful failure modes. While user-simulation-based evaluation offers a promising alternative, existing simulation frameworks suffer from two major limitations. First, they provide limited mechanisms for evaluating the quality and comprehensiveness of simulated interactions, making it difficult to assess whether a simulator sufficiently explores an agent's capabilities and failure modes. Second, most frameworks are restricted to either UI-only actions or API-only actions, limiting their ability to model the full range of realistic user behaviors. To address these limitations, we propose VISTA, a Versatile Interactive user Simulation Toolkit for Agent evaluation. Our toolkit includes a suite of six metrics for measuring the realism, capability coverage, and interaction effectiveness of simulated interactions. In addition, we develop a hybrid user simulator that integrates both UI-based interactions and API-based interactions, enabling more realistic and comprehensive evaluation across diverse interactive environments. We evaluate VISTA in e-commerce shopping and education customer service settings and demonstrate that it produces more realistic and comprehensive evaluations than existing methods.

Problem

Research questions and friction points this paper is trying to address.

interactive agent evaluation

user simulation

static benchmarks

failure modes

realistic user behaviors

Innovation

Methods, ideas, or system contributions that make the work stand out.

user simulation

interactive agent evaluation

hybrid interaction