🤖 AI Summary
This work addresses the challenges in large language model training caused by the high cost of human-annotated high-quality data and the limited adaptability and lack of exemplar guidance in existing automated data preparation methods. The authors propose a multi-level self-evolving data preparation framework that, for the first time, endows data processing systems with self-evolution capabilities. By integrating operator-level extensibility and pipeline-level feedback loops, the framework dynamically narrows the distributional gap between generated data and high-quality exemplars. Its core components include logical plan construction, dependency conflict resolution, code instantiation, and an iterative optimization mechanism. Experiments across seven benchmarks demonstrate that the generated data substantially enhances training effectiveness, improving the average downstream performance of large language models by 10% over models trained on original data.
📝 Abstract
High-quality training data is essential to large language models (LLMs) and typically requires extensive and costly manual curation. Existing automatic data preparation methods rely on predefined pipelines or customized human instructions, which limits their adaptability to diverse data distributions and lacks principled guidance from high-quality examples. In this paper, we introduce DataEvolver, the first self-evolving data preparation system that automatically constructs pipelines to transform raw data into high-quality data. DataEvolver employs a multi-level mechanism to ensure both pipeline executability and effectiveness. At the operator level, it incrementally expands the operator set to construct a logical plan while resolving dependency conflicts. At the pipeline level, it instantiates logical plans into executable code and iteratively refines pipeline orchestration through a feedback loop that reduces the distribution gap between prepared data and high-quality examples. Experiments on seven benchmarks show that DataEvolver substantially improves data quality and achieves an average 10\% gain in downstream LLM performance compared with training on original data, highlighting new opportunities for the iterative co-evolution of LLMs and data.