MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment Simulation

📅 2026-06-01
📈 Citations: 0
Influential: 0
📄 PDF

career value

215K/year
🤖 AI Summary
Existing benchmarks for large language models (LLMs) primarily focus on general-purpose information tools and fail to adequately assess model performance in realistic social and collaborative scenarios involving personal accounts or local databases. To address this gap, this work introduces the first evaluation benchmark tailored to personalized Model Context Protocol (MCP) tools. The framework simulates APIs and local environments of real-world platforms such as Reddit, Xiaohongshu, Feishu, and Slack, enabling reproducible tool-interaction testing. Experimental results demonstrate that current mainstream LLM agents exhibit significant deficiencies in invoking personalized tools, thereby validating the effectiveness and necessity of the proposed benchmark in filling a critical void in evaluating LLM capabilities within authentic application contexts.
📝 Abstract
The Model Context Protocol (MCP) has emerged as a transformative standard for connecting large language models (LLMs) with external data sources and tools, and has been rapidly adopted across personal applications and development platforms. However, existing benchmarks predominantly focus on generic information-seeking tools and fail to capture the practical challenges posed by personal social applications, where tools interact with individual accounts or local databases. To bridge this critical gap, we introduce MCP-Persona, the first benchmark specifically designed for evaluating agent performance on real-world, personalized MCP tools. MCP-Persona encompasses a diverse set of widely-used applications, ranging from social media platforms like Reddit and Xiaohongshu (Rednote) to enterprise collaboration suites such as Lark (Feishu) and Slack. Our extensive experiments on various state-of-the-art (SOTA) agents demonstrate their significant struggles with personalized tool use, thereby highlighting the benchmark's crucial role in identifying and addressing these limitations. MCP-Persona is publicly available at https://github.com/wwh0411/MCP-Persona}{https://github.com/wwh0411/MCP-Persona.
Problem

Research questions and friction points this paper is trying to address.

MCP
LLM agents
personal applications
benchmarking
tool use
Innovation

Methods, ideas, or system contributions that make the work stand out.

MCP-Persona
LLM agents
personalized applications
environment simulation
tool-use benchmarking
Wenhao Wang
Wenhao Wang
Zhejiang University
Large Language ModelGUI AgentFederated LearningSynthetic Data
P
Peizhi Niu
Multi-Agent Governance & Intelligence Crew (MAGIC), Shanghai Jiao Tong University, Shanghai, China; University of Illinois at Urbana-Champaign, Illinois, USA
G
Gongyi Zou
Multi-Agent Governance & Intelligence Crew (MAGIC), Shanghai Jiao Tong University, Shanghai, China; University of Oxford, Oxford, UK
Xiyuan Yang
Xiyuan Yang
UIUC
Trustworthy Machine Learning
J
Jingxing Wang
Multi-Agent Governance & Intelligence Crew (MAGIC), Shanghai Jiao Tong University, Shanghai, China
H
Haoting Shi
Multi-Agent Governance & Intelligence Crew (MAGIC), Shanghai Jiao Tong University, Shanghai, China
Yaxin Du
Yaxin Du
Shanghai Jiao Tong University
federated learningLLM agents
Jingyi Chai
Jingyi Chai
Shanghai Jiao Tong University
Large Language ModelFederated Learning
Xianghe Pang
Xianghe Pang
Shanghai Jiao Tong University
LLM Agent
Shuo Tang
Shuo Tang
Shanghai Jiao Tong University
Yanfeng Wang
Yanfeng Wang
Shanghai Jiao Tong University
Siheng Chen
Siheng Chen
Shanghai Jiao Tong University
Collective intelligenceLLM agentgraph signal processingcollaborative perception