Evaluating Personalized Tool-Augmented LLMs from the Perspectives of Personalization and Proactivity

📅 2025-03-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM tool-use evaluation benchmarks neglect the joint assessment of personalization and proactivity. Method: We propose ETAPP—the first large-scale benchmark jointly evaluating both dimensions—comprising 800 diverse user-profile test cases and a sandboxed tool-execution environment for secure, controllable evaluation. We introduce a keypoint-guided “LLM-as-a-judge” paradigm to mitigate evaluator bias, design a preference-specification mechanism, and develop a tool-call strategy analysis framework to systematically characterize how fine-tuning affects personalization performance. Contribution/Results: Experiments validate our methodology and, for the first time, systematically reveal capability disparities among mainstream LLMs in personalized tool invocation. ETAPP provides empirical foundations and actionable optimization pathways for building adaptive, user-centered LLM agents.

Technology Category

Application Category

📝 Abstract
Personalized tool utilization is essential for aligning large language models (LLMs) with user preference in interaction scenarios with various tools. However, most of the current benchmarks primarily focus on either personalization of text generation or direct tool-utilizing, without considering both. In this work, we introduce a novel benchmark ETAPP for evaluating personalized tool invocation, establishing a sandbox environment, and a comprehensive dataset of 800 testing cases covering diverse user profiles. To improve the accuracy of our evaluation, we propose a key-point-based LLM evaluation method, mitigating biases in the LLM-as-a-judge system by manually annotating key points for each test case and providing them to LLM as the reference. Additionally, we evaluate the excellent LLMs and provide an in-depth analysis. Furthermore, we investigate the impact of different tool-invoking strategies on LLMs' personalization performance and the effects of fine-tuning in our task. The effectiveness of our preference-setting and key-point-based evaluation method is also validated. Our findings offer insights into improving personalized LLM agents. Our Code is available at https://github.com/hypasd-art/ETAPP.
Problem

Research questions and friction points this paper is trying to address.

Evaluates personalized tool-augmented LLMs for user alignment.
Introduces ETAPP benchmark for personalized tool invocation evaluation.
Proposes key-point-based method to mitigate LLM evaluation biases.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces ETAPP benchmark for personalized tool evaluation
Uses key-point-based method to reduce LLM evaluation bias
Analyzes tool-invoking strategies and fine-tuning impacts
🔎 Similar Papers
No similar papers found.
Y
Yupu Hao
The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
P
Pengfei Cao
The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
Zhuoran Jin
Zhuoran Jin
Institute of Automation, Chinese Academy of Sciences
Large Language ModelsNatural Language ProcessingKnowledge Engineering
Huanxuan Liao
Huanxuan Liao
Institute of Automation, Chinese Academy of Sciences
Natural Language ProcessingLarge Language ModelLong Context Modeling
Yubo Chen
Yubo Chen
Institute of Automation, Chinese Academy of Sciences
Natural Language ProcessingInformation ExtractionEvent ExtractionLarge Language Model
K
Kang Liu
The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
J
Jun Zhao
The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China