Quality Assessment of Tabular Data using Large Language Models and Code Generation

📅 2025-09-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional rule-based approaches for tabular data quality assessment suffer from low efficiency, heavy reliance on manual effort, and high computational costs. To address these limitations, this paper proposes a three-stage automated framework: (1) statistical inlier detection to identify anomalous patterns; (2) iterative generation of semantically accurate and executable quality rules and corresponding validation code, leveraging large language models (LLMs) augmented with retrieval-augmented generation (RAG) and domain-specific examples; and (3) constraint-guided generation coupled with few-shot fine-tuning to enhance reliability. The framework integrates clustering analysis, LLMs, code-generation models, and RAG for end-to-end rule discovery and validation. Experiments across multiple benchmark datasets demonstrate substantial improvements in rule generation accuracy and automation: achieved rule validity rate reaches 92.3%, while average generation time is reduced by 67%.

Technology Category

Application Category

📝 Abstract
Reliable data quality is crucial for downstream analysis of tabular datasets, yet rule-based validation often struggles with inefficiency, human intervention, and high computational costs. We present a three-stage framework that combines statistical inliner detection with LLM-driven rule and code generation. After filtering data samples through traditional clustering, we iteratively prompt LLMs to produce semantically valid quality rules and synthesize their executable validators through code-generating LLMs. To generate reliable quality rules, we aid LLMs with retrieval-augmented generation (RAG) by leveraging external knowledge sources and domain-specific few-shot examples. Robust guardrails ensure the accuracy and consistency of both rules and code snippets. Extensive evaluations on benchmark datasets confirm the effectiveness of our approach.
Problem

Research questions and friction points this paper is trying to address.

Automating tabular data quality assessment
Reducing rule-based validation inefficiency and costs
Generating executable quality rules via LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combining statistical detection with LLM-driven rule generation
Using retrieval-augmented generation for quality rule creation
Synthesizing executable validators through code-generating LLMs
🔎 Similar Papers
No similar papers found.