DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI

📅 2025-12-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The era of large language models (LLMs) faces critical challenges including insufficient high-quality data supply, fragmented data preparation pipelines, poor reproducibility, and lack of model-in-the-loop support. Method: We propose the first LLM-driven, unified data preparation framework for data-centric AI, featuring system-level abstractions and PyTorch-style APIs for modular design. We introduce DataFlow-Agent—the first agent that synthesizes executable data pipelines end-to-end from natural language specifications—and integrate LLM-powered operator synthesis, iterative validation, 200+ reusable operators, and six domain-agnostic pipeline templates. Results: Experiments on Text-to-SQL, code generation, and mathematical reasoning show our synthesized data significantly outperforms human-annotated and domain-specific synthetic data. Remarkably, just 10K samples surpass the performance of models trained on the million-scale Infinity-Instruct dataset, empirically validating the decisive impact of data quality on model performance.

Technology Category

Application Category

📝 Abstract
The rapidly growing demand for high-quality data in Large Language Models (LLMs) has intensified the need for scalable, reliable, and semantically rich data preparation pipelines. However, current practices remain dominated by ad-hoc scripts and loosely specified workflows, which lack principled abstractions, hinder reproducibility, and offer limited support for model-in-the-loop data generation. To address these challenges, we present DataFlow, a unified and extensible LLM-driven data preparation framework. DataFlow is designed with system-level abstractions that enable modular, reusable, and composable data transformations, and provides a PyTorch-style pipeline construction API for building debuggable and optimizable dataflows. The framework consists of nearly 200 reusable operators and six domain-general pipelines spanning text, mathematical reasoning, code, Text-to-SQL, agentic RAG, and large-scale knowledge extraction. To further improve usability, we introduce DataFlow-Agent, which automatically translates natural-language specifications into executable pipelines via operator synthesis, pipeline planning, and iterative verification. Across six representative use cases, DataFlow consistently improves downstream LLM performance. Our math, code, and text pipelines outperform curated human datasets and specialized synthetic baselines, achieving up to +3% execution accuracy in Text-to-SQL over SynSQL, +7% average improvements on code benchmarks, and 1--3 point gains on MATH, GSM8K, and AIME. Moreover, a unified 10K-sample dataset produced by DataFlow enables base models to surpass counterparts trained on 1M Infinity-Instruct data. These results demonstrate that DataFlow provides a practical and high-performance substrate for reliable, reproducible, and scalable LLM data preparation, and establishes a system-level foundation for future data-centric AI development.
Problem

Research questions and friction points this paper is trying to address.

Addresses scalable, reliable data preparation for LLMs
Replaces ad-hoc scripts with modular, reusable data transformations
Automates pipeline creation from natural language specifications
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-driven framework for unified data preparation
Modular and composable data transformations with PyTorch-style API
Agent automatically converts natural language to executable pipelines
🔎 Similar Papers
No similar papers found.
H
Hao Liang
1Peking University,2Institute for Advanced Algorithms Research, Shanghai,3OriginHub Technology,4OpenDataLab, Shanghai Artificial Intelligence Laboratory,5LLaMA-Factory Team
X
Xiaochen Ma
1Peking University,2Institute for Advanced Algorithms Research, Shanghai,3OriginHub Technology,4OpenDataLab, Shanghai Artificial Intelligence Laboratory,5LLaMA-Factory Team
Zhou Liu
Zhou Liu
China Southern Power Grid/ Shenzhen Power Supply Co., Ltd.
Renewable Power IntegrationSmart gridPower system protectionDigital substationAI technology
Z
Zhen Hao Wong
1Peking University,2Institute for Advanced Algorithms Research, Shanghai,3OriginHub Technology,4OpenDataLab, Shanghai Artificial Intelligence Laboratory,5LLaMA-Factory Team
Z
Zhengyang Zhao
1Peking University,2Institute for Advanced Algorithms Research, Shanghai,3OriginHub Technology,4OpenDataLab, Shanghai Artificial Intelligence Laboratory,5LLaMA-Factory Team
Z
Zimo Meng
1Peking University,2Institute for Advanced Algorithms Research, Shanghai,3OriginHub Technology,4OpenDataLab, Shanghai Artificial Intelligence Laboratory,5LLaMA-Factory Team
R
Runming He
1Peking University,2Institute for Advanced Algorithms Research, Shanghai,3OriginHub Technology,4OpenDataLab, Shanghai Artificial Intelligence Laboratory,5LLaMA-Factory Team
C
Chengyu Shen
1Peking University,2Institute for Advanced Algorithms Research, Shanghai,3OriginHub Technology,4OpenDataLab, Shanghai Artificial Intelligence Laboratory,5LLaMA-Factory Team
Q
Qifeng Cai
1Peking University,2Institute for Advanced Algorithms Research, Shanghai,3OriginHub Technology,4OpenDataLab, Shanghai Artificial Intelligence Laboratory,5LLaMA-Factory Team
Zhaoyang Han
Zhaoyang Han
Nanjing Forestry University
M
Meiyi Qiang
1Peking University,2Institute for Advanced Algorithms Research, Shanghai,3OriginHub Technology,4OpenDataLab, Shanghai Artificial Intelligence Laboratory,5LLaMA-Factory Team
Y
Yalin Feng
1Peking University,2Institute for Advanced Algorithms Research, Shanghai,3OriginHub Technology,4OpenDataLab, Shanghai Artificial Intelligence Laboratory,5LLaMA-Factory Team
Tianyi Bai
Tianyi Bai
Hong Kong University of Science and Technology(HKUST)
Large Language Models
Z
Zewei Pan
1Peking University,2Institute for Advanced Algorithms Research, Shanghai,3OriginHub Technology,4OpenDataLab, Shanghai Artificial Intelligence Laboratory,5LLaMA-Factory Team
Z
Ziyi Guo
1Peking University,2Institute for Advanced Algorithms Research, Shanghai,3OriginHub Technology,4OpenDataLab, Shanghai Artificial Intelligence Laboratory,5LLaMA-Factory Team
Y
Yizhen Jiang
1Peking University,2Institute for Advanced Algorithms Research, Shanghai,3OriginHub Technology,4OpenDataLab, Shanghai Artificial Intelligence Laboratory,5LLaMA-Factory Team
J
Jingwen Deng
1Peking University,2Institute for Advanced Algorithms Research, Shanghai,3OriginHub Technology,4OpenDataLab, Shanghai Artificial Intelligence Laboratory,5LLaMA-Factory Team
Q
Qijie You
1Peking University,2Institute for Advanced Algorithms Research, Shanghai,3OriginHub Technology,4OpenDataLab, Shanghai Artificial Intelligence Laboratory,5LLaMA-Factory Team
P
Peichao Lai
1Peking University,2Institute for Advanced Algorithms Research, Shanghai,3OriginHub Technology,4OpenDataLab, Shanghai Artificial Intelligence Laboratory,5LLaMA-Factory Team
T
Tianyu Guo
1Peking University,2Institute for Advanced Algorithms Research, Shanghai,3OriginHub Technology,4OpenDataLab, Shanghai Artificial Intelligence Laboratory,5LLaMA-Factory Team
C
Chi Hsu Tsai
1Peking University,2Institute for Advanced Algorithms Research, Shanghai,3OriginHub Technology,4OpenDataLab, Shanghai Artificial Intelligence Laboratory,5LLaMA-Factory Team
H
Hengyi Feng
1Peking University,2Institute for Advanced Algorithms Research, Shanghai,3OriginHub Technology,4OpenDataLab, Shanghai Artificial Intelligence Laboratory,5LLaMA-Factory Team
R
Rui Hu
1Peking University,2Institute for Advanced Algorithms Research, Shanghai,3OriginHub Technology,4OpenDataLab, Shanghai Artificial Intelligence Laboratory,5LLaMA-Factory Team
W
Wenkai Yu
1Peking University,2Institute for Advanced Algorithms Research, Shanghai,3OriginHub Technology,4OpenDataLab, Shanghai Artificial Intelligence Laboratory,5LLaMA-Factory Team
Junbo Niu
Junbo Niu
Peking University
Foundation Model