MirrorBench: A Benchmark to Evaluate Conversational User-Proxy Agents for Human-Likeness

📅 2026-01-13

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the limitations of current large language models in simulating human-like users, which often produce verbose and inauthentic utterances, and critically lack a systematic evaluation framework for assessing user agent "humanness." To bridge this gap, we propose the first task-agnostic paradigm for evaluating user agent humanness, featuring an extensible and reproducible benchmark framework with a modular plug-in architecture and variance-aware evaluation mechanisms. Built upon a modular execution engine, typed interfaces, a metadata-driven registry, and multi-backend support, the framework integrates six complementary metrics—including lexical diversity and LLM-based judges. Experiments across four public datasets reveal systematic discrepancies between existing user agents and real human users. We open-source the complete toolchain to facilitate configuration, execution, and report generation.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly used as human simulators, both for evaluating conversational systems and for generating fine-tuning data. However, naive"act-as-a-user"prompting often yields verbose, unrealistic utterances, motivating principled evaluation of *user proxy agents*. We present **MirrorBench**, a reproducible and extensible benchmarking framework that evaluates user proxies solely on their ability to produce human-like user utterances across diverse conversational regimes, explicitly decoupled from downstream task success. **MirrorBench** combines three lexical-diversity metrics (**MATTR**, **Yule's~$K$**, and **HD-D**) with three LLM-judge-based metrics (**GTEval**, **Pairwise Indistinguishability**, and **Rubric-and-Reason**), and contextualizes judge scores using Human-Human and Proxy-Proxy calibration controls. Across four public datasets, **MirrorBench** yields variance-aware comparisons and reveals systematic gaps between user proxies and real human users. The framework is [open source](https://github.com/SAP/mirrorbench) and includes a command-line interface for running and managing user-proxy benchmarking experiments.

Problem

Research questions and friction points this paper is trying to address.

user-proxy agents

human-likeness

conversational systems

evaluation framework

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

user-proxy agents

human-likeness evaluation

modular benchmarking framework

LLM-based metrics

extensible architecture

🔎 Similar Papers

No similar papers found.

Authors to Follow