🤖 AI Summary
Existing LLM tool-use evaluation benchmarks suffer from narrow scenario coverage, limited assessment dimensions, and low efficiency, failing to adequately reflect real-world complexities such as multi-turn dialogues, ambiguous instructions, and multi-agent interactions. To address these limitations, we propose the first comprehensive benchmark for evaluating LLMs’ tool-calling capabilities, introducing a novel three-tiered evaluation paradigm—Normal, Special, and Agent. We design a lightweight result verification mechanism combining rule-based and symbolic validation, a multi-turn state modeling method, and a multi-agent collaborative simulation framework. Additionally, we develop a structured toolchain for annotating and analyzing function-call trajectories. Our benchmark is systematically evaluated across 12 mainstream LLMs, achieving 92% error attribution accuracy and improving evaluation throughput by 47×. It effectively uncovers critical weaknesses in LLMs’ ambiguous instruction understanding and collaborative execution.
📝 Abstract
Large language models (LLMs) have demonstrated significant potential in decision-making and reasoning, especially when combined with various tools to effectively solve complex problems. However, existing evaluation systems for assessing LLM function calling capabilities have several limitations: (1) limited evaluation scenarios, lacking assessments in real multi-turn dialogue contexts; (2) narrow evaluation dimensions, lacking detailed assessments for fine-grained function calls; (3) relying on LLMs or real API executions for result evaluation, which introduces significant overhead. To address these issues, we propose a comprehensive evaluation system named ACEBench. This system is meticulously designed to encompass a wide spectrum of function calling scenarios. Moreover, it categorizes these scenarios into three primary types according to the evaluation methodology: Normal, Special, and Agent. Normal evaluates function calls in basic scenarios; Special evaluates function calls in scenarios with vague or incomplete instructions; Agent introduces multi-agent interactions to simulate function calling evaluation in real-world multi-turn interactions. We conducted extensive experiments on ACEBench, analyzing various LLMs in-depth and performing a more granular analysis of error causes across different data types.