🤖 AI Summary
This study investigates whether large language models (LLMs) can reliably substitute for real human users in agent evaluation, with a particular focus on validity and fairness across culturally and linguistically diverse contexts. Through a large-scale user study conducted in the United States, India, Kenya, and Nigeria, combined with fine-grained analyses of multiple LLM-based simulated users, task difficulty stratification, and dialectal variation (e.g., African American Vernacular English vs. Standard American English), the work reveals significant calibration biases and underrepresentation in LLM simulations for multilingual and multidialectal populations. The findings demonstrate that simulated users induce up to a 9-percentage-point fluctuation in agent success rates, systematically underestimating performance on high-difficulty tasks while overestimating it on medium-difficulty ones, and disproportionately misjudging AAVE speakers—an effect exacerbated by age—thereby challenging the validity of current mainstream evaluation paradigms.
📝 Abstract
Agentic benchmarks increasingly rely on LLM-simulated users to scalably evaluate agent performance, yet the robustness, validity, and fairness of this approach remain unexamined. Through a user study with participants across the United States, India, Kenya, and Nigeria, we investigate whether LLM-simulated users serve as reliable proxies for real human users in evaluating agents on {\tau}-Bench retail tasks. We find that user simulation lacks robustness, with agent success rates varying up to 9 percentage points across different user LLMs. Furthermore, evaluations using simulated users exhibit systematic miscalibration, underestimating agent performance on challenging tasks and overestimating it on moderately difficult ones. African American Vernacular English (AAVE) speakers experience consistently worse success rates and calibration errors than Standard American English (SAE) speakers, with disparities compounding significantly with age. We also find simulated users to be a differentially effective proxy for different populations, performing worst for AAVE and Indian English speakers. Additionally, simulated users introduce conversational artifacts and surface different failure patterns than human users. These findings demonstrate that current evaluation practices risk misrepresenting agent capabilities across diverse user populations and may obscure real-world deployment challenges.