π€ AI Summary
This work addresses the limited performance of large language models (LLMs) in specialized domains, primarily due to the scarcity of high-quality domain-specific data and the reliance of existing data construction methods on manual processes. To overcome these challenges, the paper introduces a novel paradigm termed βautonomous agent-driven data engineering,β which for the first time treats LLMs as autonomous data engineers capable of end-to-end, human-intervention-free data generation and refinement. The proposed framework integrates planning, curriculum learning, and iterative feedback mechanisms to dynamically construct and optimize training data. Experimental results demonstrate that curricula generated by GPT-5.2 within this framework improve student model performance in the target domain by 57.29%, offering strong empirical validation of the efficacy and innovation of agent-driven data engineering.
π Abstract
Large Language Models (LLMs) have demonstrated strong performance on general tasks, while often struggling to adapt to specialized domains without high-quality domain-specific data. Existing LLM-based data curation methods primarily rely on human-designed workflows, leaving it unexamined whether LLMs can autonomously execute an end-to-end data engineering pipeline for model specialization. We formalize \textbf{Autonomous Agentic Data Engineering}, a novel task designed to evaluate LLMs as autonomous data engineers that drive model specialization through end-to-end data curation. We frame data as an optimizable component and study agents that plan, generate, and iteratively optimize training data across multiple domains, guided by post-training performance improvement. Experiments show that autonomous LLM data engineers yield substantial gains, as GPT-5.2 constructs a training curriculum that improves a student model by \textbf{57.29\%}, entirely through iterative, agent-driven data adaptation. By illuminating both potential and bottlenecks, our study establishes autonomous data engineering as a measurable capability and charts a path toward agent-driven model specialization\footnote{Code will be released at https://github.com/zjunlp/DataAgent.}.