🤖 AI Summary
Traditional rule-based approaches for tabular data quality assessment suffer from low efficiency, heavy reliance on manual effort, and high computational costs. To address these limitations, this paper proposes a three-stage automated framework: (1) statistical inlier detection to identify anomalous patterns; (2) iterative generation of semantically accurate and executable quality rules and corresponding validation code, leveraging large language models (LLMs) augmented with retrieval-augmented generation (RAG) and domain-specific examples; and (3) constraint-guided generation coupled with few-shot fine-tuning to enhance reliability. The framework integrates clustering analysis, LLMs, code-generation models, and RAG for end-to-end rule discovery and validation. Experiments across multiple benchmark datasets demonstrate substantial improvements in rule generation accuracy and automation: achieved rule validity rate reaches 92.3%, while average generation time is reduced by 67%.
📝 Abstract
Reliable data quality is crucial for downstream analysis of tabular datasets, yet rule-based validation often struggles with inefficiency, human intervention, and high computational costs. We present a three-stage framework that combines statistical inliner detection with LLM-driven rule and code generation. After filtering data samples through traditional clustering, we iteratively prompt LLMs to produce semantically valid quality rules and synthesize their executable validators through code-generating LLMs. To generate reliable quality rules, we aid LLMs with retrieval-augmented generation (RAG) by leveraging external knowledge sources and domain-specific few-shot examples. Robust guardrails ensure the accuracy and consistency of both rules and code snippets. Extensive evaluations on benchmark datasets confirm the effectiveness of our approach.