Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents

📅 2025-10-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current AI agent supervised fine-tuning (SFT) suffers from heterogeneous, fragmented training data—diverse in origin, format, and schema—leading to high integration costs and hindering standardized, scalable training. To address this, we propose the Agent Data Protocol (ADP), a lightweight, general-purpose structured representation language. ADP unifies 13 heterogeneous agent datasets—including API invocation, code generation, and web interaction—into a single, semantically consistent intermediate format, eliminating per-dataset engineering. A modular parsing and conversion toolchain enables seamless export of training-ready data for mainstream agent frameworks. Extensive large-scale SFT evaluation demonstrates that ADP improves baseline model performance by ~20% on average across standard benchmarks, achieving state-of-the-art (SOTA) or near-SOTA results without domain-specific tuning. All code and data are publicly released.

Technology Category

Application Category

📝 Abstract
Public research results on large-scale supervised finetuning of AI agents remain relatively rare, since the collection of agent training data presents unique challenges. In this work, we argue that the bottleneck is not a lack of underlying data sources, but that a large variety of data is fragmented across heterogeneous formats, tools, and interfaces. To this end, we introduce the agent data protocol (ADP), a light-weight representation language that serves as an"interlingua"between agent datasets in diverse formats and unified agent training pipelines downstream. The design of ADP is expressive enough to capture a large variety of tasks, including API/tool use, browsing, coding, software engineering, and general agentic workflows, while remaining simple to parse and train on without engineering at a per-dataset level. In experiments, we unified a broad collection of 13 existing agent training datasets into ADP format, and converted the standardized ADP data into training-ready formats for multiple agent frameworks. We performed SFT on these data, and demonstrated an average performance gain of ~20% over corresponding base models, and delivers state-of-the-art or near-SOTA performance on standard coding, browsing, tool use, and research benchmarks, without domain-specific tuning. All code and data are released publicly, in the hope that ADP could help lower the barrier to standardized, scalable, and reproducible agent training.
Problem

Research questions and friction points this paper is trying to address.

Unifying fragmented agent datasets across diverse formats and interfaces
Creating standardized protocol for agent training data interoperability
Enabling effective fine-tuning of LLM agents without per-dataset engineering
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces lightweight Agent Data Protocol for dataset unification
Converts diverse agent datasets into standardized training format
Enables unified fine-tuning across multiple agent frameworks
🔎 Similar Papers
No similar papers found.
Yueqi Song
Yueqi Song
BS/MS student, Carnegie Mellon University
AI AgentsMultimodal NLPMultilingual NLP
K
Ketan Ramaneti
Carnegie Mellon University
Z
Zaid Sheikh
Carnegie Mellon University
Ziru Chen
Ziru Chen
The Ohio State University
Conversational AINatural Language ProcessingMachine Learning
Boyu Gou
Boyu Gou
The Ohio State University
Artificial IntelligenceLanguage AgentsGUI Agents
Tianbao Xie
Tianbao Xie
University of Hong Kong
Artificial IntelligenceDeep LearningNatural Language Processing
Yiheng Xu
Yiheng Xu
University of Hong Kong
Natural Language Processing
D
Danyang Zhang
University of Hong Kong
Apurva Gandhi
Apurva Gandhi
Carnegie Mellon University
Machine LearningArtificial Intelligence
F
Fan Yang
Fujitsu Research
J
Joseph Liu
Carnegie Mellon University
T
Tianyue Ou
Carnegie Mellon University
Zhihao Yuan
Zhihao Yuan
Ph.D student at The Chinese University of Hong Kong, Shenzhen
Vision and Language3D Scene Understanding
F
Frank Xu
Carnegie Mellon University
Shuyan Zhou
Shuyan Zhou
Duke University
Large Language ModelsAI Agent
Xingyao Wang
Xingyao Wang
All Hands AI, University of Illinois Urbana-Champaign
Xiang Yue
Xiang Yue
Carnegie Mellon University
Natural Language ProcessingLarge Language ModelsMachine Learning
T
Tao Yu
University of Hong Kong
Huan Sun
Huan Sun
Endowed CoE Innovation Scholar and Associate Professor, The Ohio State University
AgentsLarge Language ModelsNatural Language ProcessingAI
Y
Yu Su
The Ohio State University
Graham Neubig
Graham Neubig
Carnegie Mellon University, All Hands AI
Natural Language ProcessingMachine LearningArtificial Intelligence