UXBench: Benchmarking User Experience in AI Assistants

📅 2026-06-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the gap in evaluating AI assistants by proposing UXBench, the first user-centered benchmark that incorporates real user feedback into the assessment of user experience (UX). Built from over 70,000 authentic interaction logs, UXBench comprises 7,400 test instances spanning eight major scenarios and 83 domains, and introduces three core tasks: UX Judge, UX Eval, and UX Recovery. The study demonstrates for the first time that user preferences are learnable and reveals systematic biases in LLM-as-a-judge approaches. Experiments across 26 state-of-the-art models show that reward models trained on genuine user feedback achieve high calibration accuracy, and that improvements in model capabilities significantly enhance user engagement—providing a quantitative foundation for optimizing AI assistants with a user-oriented focus.
📝 Abstract
As AI assistants serve millions of users daily, evaluating user experience (UX) beyond general model capability has become increasingly important. We present UXBench, the first user-centric benchmark grounded in real user feedback signals for evaluating preference alignment and dialogue generation. The benchmark consists of three interconnected tasks, UX Judge, UX Eval, and UX Recovery, with 7,400 test instances extracted from over 70K interaction logs of a mainstream Chinese AI assistant. The dataset closely reflects real user distributions, covering 8 scenarios, 83 domains, and diverse failure patterns that pose severe challenges. Extensive experiments on 26 frontier language models provide novel insights into how well models perceive user experience and how improvements in model capability contribute to better dialogue engagement. Through comprehensive analysis of model behavior and performance gaps, we show that user feedback prediction is a learnable capability, where a reward model trained from in-the-wild feedback signals can achieve well-calibrated accuracy. We further document the systematic biases of LLM-as-a-judge evaluation protocols and compare typical response strategies that directly affect user experience. UXBench establishes a new evaluation landscape and calls for greater attention to tailored UX optimization, contributing to a user-centric scaling law that shapes the success of AI assistants.
Problem

Research questions and friction points this paper is trying to address.

user experience
AI assistants
preference alignment
dialogue generation
user feedback
Innovation

Methods, ideas, or system contributions that make the work stand out.

UXBench
user experience
preference alignment
reward modeling
LLM-as-a-judge
🔎 Similar Papers
No similar papers found.
M
Mengze Hong
Hong Kong Polytechnic University; Tencent
X
Xia Zeng
Tencent
Z
Zeyang Lei
Tencent
S
Sheng Wang
Tencent
Chen Jason Zhang
Chen Jason Zhang
Hong Kong Polytechnic University
Human-Centered ComputingAI for ScienceAI for Hospitality Management
D
Di Jiang
Tencent
T
Taiming Fu
Tencent
J
Jinfeng Huang
Tencent
M
Mengqiao Liu
Tencent
Q
Qinghe Chang
Tencent
Haosheng Zou
Haosheng Zou
Tsinghua University
Reinforcement Learning
Q
Qiongyi Zhou
Tencent
S
Sijun He
Tencent
C
Chen Xiaoshuai
Tencent
S
Simon Deng
Tencent
Haojing Huang
Haojing Huang
Tsinghua University
Natural Language ProcessingLarge Language Model
Z
Zijian Li
Tencent
L
Lucas Mu Li
Tencent
F
Fubao Zhang
Tencent
M
Mona Zhou
Tencent
W
Wei Ma
Tencent
C
Chenxuan Ma
Tencent
Y
Yuanmeng Zhang
Tencent
J
Jian Song
Tencent
Minlong Peng
Minlong Peng
Baidu
Machine LearningNatural Language Processing