🤖 AI Summary
Data standardization is critical in the data science lifecycle, yet existing tools (e.g., Pandas) require manual, error-prone coding, while LLM-based automation still demands expert prompt engineering and iterative interaction. To address this, we propose a declarative API-driven LLM-Agent framework that introduces *Dataprep.Clean*—a novel, column-type-aware standardization component enabling end-to-end automation via a single-line operation and one-shot natural language input. Our method integrates domain knowledge modeling with lightweight agent orchestration, eliminating programming prerequisites and enabling semantic cleaning of heterogeneous columns. Evaluated on real-world datasets, the approach achieves high accuracy and robustness across diverse standardization tasks. Deployed as an interactive web tool, it substantially lowers the barrier to entry for data practitioners. This work advances data preprocessing toward declarative, intelligent automation—bridging the gap between domain expertise and scalable, user-friendly tooling.
📝 Abstract
Data standardization is a crucial part of the data science life cycle. While tools like Pandas offer robust functionalities, their complexity and the manual effort required for customizing code to diverse column types pose significant challenges. Although large language models (LLMs) like ChatGPT have shown promise in automating this process through natural language understanding and code generation, it still demands expert-level programming knowledge and continuous interaction for prompt refinement. To solve these challenges, our key idea is to propose a Python library with declarative, unified APIs for standardizing different column types, simplifying the LLM's code generation with concise API calls. We first propose Dataprep.Clean, a component of the Dataprep Python Library, significantly reduces the coding complexity by enabling the standardization of specific column types with a single line of code. Then, we introduce the CleanAgent framework integrating Dataprep.Clean and LLM-based agents to automate the data standardization process. With CleanAgent, data scientists only need to provide their requirements once, allowing for a hands-free process. To demonstrate the practical utility of CleanAgent, we developed a user-friendly web application, allowing attendees to interact with it using real-world datasets.