MirrorBench: A Benchmark to Evaluate Conversational User-Proxy Agents for Human-Likeness

📅 2026-01-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of current large language models in simulating human-like users, which often produce verbose and inauthentic utterances, and critically lack a systematic evaluation framework for assessing user agent "humanness." To bridge this gap, we propose the first task-agnostic paradigm for evaluating user agent humanness, featuring an extensible and reproducible benchmark framework with a modular plug-in architecture and variance-aware evaluation mechanisms. Built upon a modular execution engine, typed interfaces, a metadata-driven registry, and multi-backend support, the framework integrates six complementary metrics—including lexical diversity and LLM-based judges. Experiments across four public datasets reveal systematic discrepancies between existing user agents and real human users. We open-source the complete toolchain to facilitate configuration, execution, and report generation.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are increasingly used as human simulators, both for evaluating conversational systems and for generating fine-tuning data. However, naive"act-as-a-user"prompting often yields verbose, unrealistic utterances, motivating principled evaluation of *user proxy agents*. We present **MirrorBench**, a reproducible and extensible benchmarking framework that evaluates user proxies solely on their ability to produce human-like user utterances across diverse conversational regimes, explicitly decoupled from downstream task success. **MirrorBench** combines three lexical-diversity metrics (**MATTR**, **Yule's~$K$**, and **HD-D**) with three LLM-judge-based metrics (**GTEval**, **Pairwise Indistinguishability**, and **Rubric-and-Reason**), and contextualizes judge scores using Human-Human and Proxy-Proxy calibration controls. Across four public datasets, **MirrorBench** yields variance-aware comparisons and reveals systematic gaps between user proxies and real human users. The framework is [open source](https://github.com/SAP/mirrorbench) and includes a command-line interface for running and managing user-proxy benchmarking experiments.
Problem

Research questions and friction points this paper is trying to address.

user-proxy agents
human-likeness
conversational systems
evaluation framework
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

user-proxy agents
human-likeness evaluation
modular benchmarking framework
LLM-based metrics
extensible architecture
🔎 Similar Papers
No similar papers found.
A
Ashutosh Hathidara
SAP Labs
J
Julien Yu
SAP Labs
V
Vaishali Senthil
SAP Labs
Sebastian Schreiber
Sebastian Schreiber
Distinguished Professor of Evolution and Ecology, University of California, Davis
mathematical biologytheoretical population biologylife history evolutionecology
A
Anil Babu Ankisettipalli
SAP Labs