Disambiguation-Centric Finetuning Makes Enterprise Tool-Calling LLMs More Realistic and Less Risky

📅 2025-07-04

📈 Citations: 0

✨ Influential: 0

career value

148K/year

🤖 AI Summary

Large language models (LLMs) frequently misinvoke enterprise APIs due to ambiguous tool intents and missing parameter specifications. To address this, we propose DiaFORGE—the first tool-disambiguation–oriented, three-stage training framework comprising persona-driven multi-turn dialogue synthesis, reasoning-trace–augmented supervised fine-tuning, and dynamic agent-loop evaluation. DiaFORGE supports optimization of open-source LLMs ranging from 3B to 70B parameters and introduces DiaBENCH, the first dedicated benchmark for enterprise API disambiguation, featuring 5,000 annotated API-call dialogues. Experimental results demonstrate that DiaFORGE achieves a 27-percentage-point improvement in tool-call success rate over GPT-4o and a 49-percentage-point gain over Claude-3.5-Sonnet on DiaBENCH. These gains significantly enhance LLM robustness and deployability in real-world enterprise settings.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly tasked with invoking enterprise APIs, yet they routinely falter when near-duplicate tools vie for the same user intent or when required arguments are left underspecified. We introduce DiaFORGE (Dialogue Framework for Organic Response Generation & Evaluation), a disambiguation-centric, three-stage pipeline that (i) synthesizes persona-driven, multi-turn dialogues in which the assistant must distinguish among highly similar tools, (ii) performs supervised fine-tuning of open-source models with reasoning traces across 3B - 70B parameters, and (iii) evaluates real-world readiness via a dynamic suite that redeploys each model in a live agentic loop and reports end-to-end goal completion alongside conventional static metrics. On our dynamic benchmark DiaBENCH, models trained with DiaFORGE raise tool-invocation success by 27 pp over GPT-4o and by 49 pp over Claude-3.5-Sonnet, both under optimized prompting. To spur further research, we release an open corpus of 5000 production-grade enterprise API specifications paired with rigorously validated, disambiguation-focused dialogues, offering a practical blueprint for building reliable, enterprise-ready tool-calling agents.

Problem

Research questions and friction points this paper is trying to address.

LLMs struggle with near-duplicate tools and underspecified arguments

Need for realistic disambiguation in enterprise API tool-calling

Lack of dynamic evaluation for real-world tool-invocation success

Innovation

Methods, ideas, or system contributions that make the work stand out.

Disambiguation-centric three-stage pipeline

Synthesizes persona-driven multi-turn dialogues

Supervised fine-tuning with reasoning traces

🔎 Similar Papers

Learning to Ask: When LLMs Meet Unclear Instruction