🤖 AI Summary
Existing research predominantly focuses on batch-oriented data preprocessing, neglecting data quality improvement (DQI) tailored to machine learning performance—leading to feature distortion and poor integration of domain knowledge. To address this, we propose d-DQIVAR, the first visual analytics and reasoning system unifying data-driven and process-driven paradigms. The data-driven component integrates missing-value imputation, anomaly detection, deduplication, normalization, and feature selection; the process-driven component introduces multi-dimensional data quality assessment, Kolmogorov–Smirnov testing, and a model-performance feedback loop. d-DQIVAR enables interactive decision-making and expert knowledge embedding, transcending conventional preprocessing frameworks. Evaluated through multiple case studies and user studies, it demonstrably enhances both data quality and downstream model accuracy. This work establishes a novel, interpretable, and human-intervention-capable paradigm for data-centric AI development.
📝 Abstract
Approaches to enhancing data quality (DQ) are classified into two main categories: data- and process-driven. However, prior research has predominantly utilized batch data preprocessing within the data-driven framework, which often proves insufficient for optimizing machine learning (ML) model performance and frequently leads to distortions in data characteristics. Existing studies have primarily focused on data preprocessing rather than genuine data quality improvement (DQI). In this paper, we introduce d-DQIVAR, a novel visual analytics system designed to facilitate DQI strategies aimed at improving ML model performance. Our system integrates visual analytics techniques that leverage both data-driven and process-driven approaches. Data-driven techniques tackle DQ issues such as imputation, outlier detection, deletion, format standardization, removal of duplicate records, and feature selection. Process-driven strategies encompass evaluating DQ and DQI procedures by considering DQ dimensions and ML model performance and applying the Kolmogorov-Smirnov test. We illustrate how our system empowers users to harness expert and domain knowledge effectively within a practical workflow through case studies, evaluations, and user studies.