ACEBench: Who Wins the Match Point in Tool Learning?

📅 2025-01-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM tool-use evaluation benchmarks suffer from narrow scenario coverage, limited assessment dimensions, and low efficiency, failing to adequately reflect real-world complexities such as multi-turn dialogues, ambiguous instructions, and multi-agent interactions. To address these limitations, we propose the first comprehensive benchmark for evaluating LLMs’ tool-calling capabilities, introducing a novel three-tiered evaluation paradigm—Normal, Special, and Agent. We design a lightweight result verification mechanism combining rule-based and symbolic validation, a multi-turn state modeling method, and a multi-agent collaborative simulation framework. Additionally, we develop a structured toolchain for annotating and analyzing function-call trajectories. Our benchmark is systematically evaluated across 12 mainstream LLMs, achieving 92% error attribution accuracy and improving evaluation throughput by 47×. It effectively uncovers critical weaknesses in LLMs’ ambiguous instruction understanding and collaborative execution.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have demonstrated significant potential in decision-making and reasoning, especially when combined with various tools to effectively solve complex problems. However, existing evaluation systems for assessing LLM function calling capabilities have several limitations: (1) limited evaluation scenarios, lacking assessments in real multi-turn dialogue contexts; (2) narrow evaluation dimensions, lacking detailed assessments for fine-grained function calls; (3) relying on LLMs or real API executions for result evaluation, which introduces significant overhead. To address these issues, we propose a comprehensive evaluation system named ACEBench. This system is meticulously designed to encompass a wide spectrum of function calling scenarios. Moreover, it categorizes these scenarios into three primary types according to the evaluation methodology: Normal, Special, and Agent. Normal evaluates function calls in basic scenarios; Special evaluates function calls in scenarios with vague or incomplete instructions; Agent introduces multi-agent interactions to simulate function calling evaluation in real-world multi-turn interactions. We conducted extensive experiments on ACEBench, analyzing various LLMs in-depth and performing a more granular analysis of error causes across different data types.
Problem

Research questions and friction points this paper is trying to address.

Large Language Model Evaluation
Multi-Agent Interaction
Comprehensive Assessment Framework
Innovation

Methods, ideas, or system contributions that make the work stand out.

ACEBench
Multi-scenario Evaluation
Large Language Model Tool Usage
🔎 Similar Papers
No similar papers found.
C
Chen Chen
University of Science and Technology of China
X
Xinlong Hao
Huawei Noah’s Ark Lab
Weiwen Liu
Weiwen Liu
Associate Professor, Shanghai Jiao Tong University
large language modelsAI agentsrecommender systems
X
Xu Huang
University of Science and Technology of China
Xingshan Zeng
Xingshan Zeng
Huawei Noah's Ark Lab
Natural Language ProcessingSpeech TranslationLarge Language Models
Shuai Yu
Shuai Yu
Huawei Noah’s Ark Lab
Dexun Li
Dexun Li
Singapore Management University
Reinforcement LearningResource OptimisationRecommendation System
S
Shuai Wang
Huawei Noah’s Ark Lab
Weinan Gan
Weinan Gan
Huawei Noah's Ark Lab
Large Language ModelGenerative IRAgent
Y
Yuefeng Huang
University of Science and Technology of China
X
Xinzhi Wang
Huawei Noah’s Ark Lab
D
Defu Lian
University of Science and Technology of China
B
Baoqun Yin
University of Science and Technology of China
Yasheng Wang
Yasheng Wang
Tencent
Natural Language Processing
W
Wu Liu
University of Science and Technology of China