d-DQIVAR: Data-centric Visual Analytics and Reasoning for Data Quality Improvement

📅 2025-07-16

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Existing research predominantly focuses on batch-oriented data preprocessing, neglecting data quality improvement (DQI) tailored to machine learning performance—leading to feature distortion and poor integration of domain knowledge. To address this, we propose d-DQIVAR, the first visual analytics and reasoning system unifying data-driven and process-driven paradigms. The data-driven component integrates missing-value imputation, anomaly detection, deduplication, normalization, and feature selection; the process-driven component introduces multi-dimensional data quality assessment, Kolmogorov–Smirnov testing, and a model-performance feedback loop. d-DQIVAR enables interactive decision-making and expert knowledge embedding, transcending conventional preprocessing frameworks. Evaluated through multiple case studies and user studies, it demonstrably enhances both data quality and downstream model accuracy. This work establishes a novel, interpretable, and human-intervention-capable paradigm for data-centric AI development.

Technology Category

Application Category

📝 Abstract

Approaches to enhancing data quality (DQ) are classified into two main categories: data- and process-driven. However, prior research has predominantly utilized batch data preprocessing within the data-driven framework, which often proves insufficient for optimizing machine learning (ML) model performance and frequently leads to distortions in data characteristics. Existing studies have primarily focused on data preprocessing rather than genuine data quality improvement (DQI). In this paper, we introduce d-DQIVAR, a novel visual analytics system designed to facilitate DQI strategies aimed at improving ML model performance. Our system integrates visual analytics techniques that leverage both data-driven and process-driven approaches. Data-driven techniques tackle DQ issues such as imputation, outlier detection, deletion, format standardization, removal of duplicate records, and feature selection. Process-driven strategies encompass evaluating DQ and DQI procedures by considering DQ dimensions and ML model performance and applying the Kolmogorov-Smirnov test. We illustrate how our system empowers users to harness expert and domain knowledge effectively within a practical workflow through case studies, evaluations, and user studies.

Problem

Research questions and friction points this paper is trying to address.

Enhance data quality for better ML performance

Combine data-driven and process-driven DQI strategies

Address DQ issues like imputation and outlier detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates visual analytics for data quality improvement

Combines data-driven and process-driven DQI strategies

Leverages expert knowledge via interactive visual interface

🔎 Similar Papers

Towards AI-Augmented Data Quality Management: From Data Quality for AI to AI for Data Quality Management