Evaluating Personalized Tool-Augmented LLMs from the Perspectives of Personalization and Proactivity

📅 2025-03-02

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing LLM tool-use evaluation benchmarks neglect the joint assessment of personalization and proactivity. Method: We propose ETAPP—the first large-scale benchmark jointly evaluating both dimensions—comprising 800 diverse user-profile test cases and a sandboxed tool-execution environment for secure, controllable evaluation. We introduce a keypoint-guided “LLM-as-a-judge” paradigm to mitigate evaluator bias, design a preference-specification mechanism, and develop a tool-call strategy analysis framework to systematically characterize how fine-tuning affects personalization performance. Contribution/Results: Experiments validate our methodology and, for the first time, systematically reveal capability disparities among mainstream LLMs in personalized tool invocation. ETAPP provides empirical foundations and actionable optimization pathways for building adaptive, user-centered LLM agents.

Technology Category

Application Category

📝 Abstract

Personalized tool utilization is essential for aligning large language models (LLMs) with user preference in interaction scenarios with various tools. However, most of the current benchmarks primarily focus on either personalization of text generation or direct tool-utilizing, without considering both. In this work, we introduce a novel benchmark ETAPP for evaluating personalized tool invocation, establishing a sandbox environment, and a comprehensive dataset of 800 testing cases covering diverse user profiles. To improve the accuracy of our evaluation, we propose a key-point-based LLM evaluation method, mitigating biases in the LLM-as-a-judge system by manually annotating key points for each test case and providing them to LLM as the reference. Additionally, we evaluate the excellent LLMs and provide an in-depth analysis. Furthermore, we investigate the impact of different tool-invoking strategies on LLMs' personalization performance and the effects of fine-tuning in our task. The effectiveness of our preference-setting and key-point-based evaluation method is also validated. Our findings offer insights into improving personalized LLM agents. Our Code is available at https://github.com/hypasd-art/ETAPP.

Problem

Research questions and friction points this paper is trying to address.

Evaluates personalized tool-augmented LLMs for user alignment.

Introduces ETAPP benchmark for personalized tool invocation evaluation.

Proposes key-point-based method to mitigate LLM evaluation biases.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces ETAPP benchmark for personalized tool evaluation

Uses key-point-based method to reduce LLM evaluation bias

Analyzes tool-invoking strategies and fine-tuning impacts

🔎 Similar Papers

No similar papers found.

Authors to Follow