🤖 AI Summary
Acquiring high-quality agentic data—encompassing user intent, tool invocation, parameterized execution, and verifiable trajectories—is costly and difficult to scale in large language model (LLM)-based agent training.
Method: This paper introduces the first fully LLM-driven, modular framework for synthetic agentic data generation. It integrates constrained prompt engineering, structured pseudocode generation, tool specification embedding, and execution trajectory modeling, augmented by JSON Schema–enforced syntactic constraints and a discriminative filtering mechanism to enable end-to-end, multi-turn, multi-task generation of goal decomposition, tool selection, and execution tracing.
Results: Experiments show >98% syntactic correctness, significantly improved semantic consistency over baselines, and >90% reduction in human annotation effort. The framework enables scalable, reproducible, and verifiable construction of agentic training data.
📝 Abstract
Large language models (LLMs) are increasingly deployed as agents, expected to decompose goals, invoke tools, and verify results in dynamic environments. Realizing these capabilities requires access to agentic data- structured interaction records that couple user intents with tool specifications, argument-grounded calls, and verifiable execution traces. However, collecting such data from human annotators is costly, time-consuming, and difficult to scale.
We present a unified framework for synthesizing agentic data using only LLMs, without any human-in-the-loop supervision. This framework decomposes generation into modular pipelines that produce complete interaction records spanning task specifications, tool definitions, policy pseudocode, natural language exchanges, and execution traces. Records conform to strict syntactic and semantic constraints, ensuring machine-parseability and faithful alignment across inputs, outputs, and tool calls.
Beyond single tasks, there is support for both multi-task and multi-turn agent interactions, enabling the construction of datasets that reflect the full spectrum of tool-use competencies. To ensure quality and consistency, the framework integrates constrained generation formats, JSON-schema validation, and judge-based filtering.
This paper formalizes the schema for agentic records, details the prompt design principles that guide generation, and introduces scalable pipelines for high-quality synthetic data. By providing a reproducible, LLM-only alternative to manual collection, hence advancing the development of agentic LLMs capable of robust tool use.