CONFETTI: Conversational Function-Calling Evaluation Through Turn-Level Interactions

📅 2025-06-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks inadequately capture the fine-grained function-calling capabilities of large language models (LLMs) in complex, multi-turn dialogues. To address this, we introduce *FuncBench*, the first multi-turn, fine-grained benchmark specifically designed for conversational function calling. It comprises 109 human-simulated dialogues, 313 user turns, and 86 real-world APIs, emphasizing challenging scenarios such as goal switching, implicit intent inference, and multi-step chained API calls. We propose a novel evaluation framework grounded in dialogue-act annotation and off-policy per-turn assessment, explicitly modeling ambiguous goals, dynamic error correction, and long-range contextual dependencies. Empirical results reveal substantial performance divergence among leading models under extended dialogues and multi-API conditions, with particularly low success rates in chained invocation. Nova Pro achieves the highest accuracy (40.01%), followed by Claude Sonnet v3.5 and Llama 3.1 405B. This work establishes a new paradigm and a rigorous, publicly available benchmark for evaluating LLMs’ functional reasoning in interactive settings.

Technology Category

Application Category

📝 Abstract
We introduce Conversational Function-Calling Evaluation Through Turn-Level Interactions (CONFETTI), a conversational benchmark1 designed to evaluate the function-calling capabilities and response quality of large language models (LLMs). Current benchmarks lack comprehensive assessment of LLMs in complex conversational scenarios. CONFETTI addresses this gap through 109 human-simulated conversations, comprising 313 user turns and covering 86 APIs. These conversations explicitly target various conversational complexities, such as follow-ups, goal correction and switching, ambiguous and implicit goals. We perform off-policy turn-level evaluation using this benchmark targeting function-calling. Our benchmark also incorporates dialog act annotations to assess agent responses. We evaluate a series of state-of-the-art LLMs and analyze their performance with respect to the number of available APIs, conversation lengths, and chained function calling. Our results reveal that while some models are able to handle long conversations, and leverage more than 20+ APIs successfully, other models struggle with longer context or when increasing the number of APIs. We also report that the performance on chained function-calls is severely limited across the models. Overall, the top performing models on CONFETTI are Nova Pro (40.01%), Claude Sonnet v3.5 (35.46%) and Llama 3.1 405B (33.19%) followed by command-r-plus (31.18%) and Mistral-Large-2407 (30.07%).
Problem

Research questions and friction points this paper is trying to address.

Evaluates LLMs' function-calling in complex conversations
Assesses response quality with dialog act annotations
Measures performance on chained function-calls and API usage
Innovation

Methods, ideas, or system contributions that make the work stand out.

Human-simulated conversations for evaluation
Turn-level off-policy function-calling assessment
Dialog act annotations for response quality
🔎 Similar Papers
No similar papers found.