d-DQIVAR: Data-centric Visual Analytics and Reasoning for Data Quality Improvement

📅 2025-07-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing research predominantly focuses on batch-oriented data preprocessing, neglecting data quality improvement (DQI) tailored to machine learning performance—leading to feature distortion and poor integration of domain knowledge. To address this, we propose d-DQIVAR, the first visual analytics and reasoning system unifying data-driven and process-driven paradigms. The data-driven component integrates missing-value imputation, anomaly detection, deduplication, normalization, and feature selection; the process-driven component introduces multi-dimensional data quality assessment, Kolmogorov–Smirnov testing, and a model-performance feedback loop. d-DQIVAR enables interactive decision-making and expert knowledge embedding, transcending conventional preprocessing frameworks. Evaluated through multiple case studies and user studies, it demonstrably enhances both data quality and downstream model accuracy. This work establishes a novel, interpretable, and human-intervention-capable paradigm for data-centric AI development.

Technology Category

Application Category

📝 Abstract
Approaches to enhancing data quality (DQ) are classified into two main categories: data- and process-driven. However, prior research has predominantly utilized batch data preprocessing within the data-driven framework, which often proves insufficient for optimizing machine learning (ML) model performance and frequently leads to distortions in data characteristics. Existing studies have primarily focused on data preprocessing rather than genuine data quality improvement (DQI). In this paper, we introduce d-DQIVAR, a novel visual analytics system designed to facilitate DQI strategies aimed at improving ML model performance. Our system integrates visual analytics techniques that leverage both data-driven and process-driven approaches. Data-driven techniques tackle DQ issues such as imputation, outlier detection, deletion, format standardization, removal of duplicate records, and feature selection. Process-driven strategies encompass evaluating DQ and DQI procedures by considering DQ dimensions and ML model performance and applying the Kolmogorov-Smirnov test. We illustrate how our system empowers users to harness expert and domain knowledge effectively within a practical workflow through case studies, evaluations, and user studies.
Problem

Research questions and friction points this paper is trying to address.

Enhance data quality for better ML performance
Combine data-driven and process-driven DQI strategies
Address DQ issues like imputation and outlier detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates visual analytics for data quality improvement
Combines data-driven and process-driven DQI strategies
Leverages expert knowledge via interactive visual interface
🔎 Similar Papers
No similar papers found.
H
Hyein Hong
Sejong University, Seoul, South Korea
Sangbong Yoo
Sangbong Yoo
Korea Institute of Science and Technology (KIST)
Data VisualizationVisual AnalyticsVolume Rendering
S
SeokHwan Choi
Sejong University, Seoul, South Korea
J
Jisue Kim
Wavebridge, South Korea
Seongbum Seo
Seongbum Seo
Sejong University
Data VisualizationNatural Language Processing
H
Haneol Cho
Sejong University, Seoul, South Korea
C
Chansoo Kim
AI, Information and Reasoning (AI/R) Laboratory, Korea Institute of Science and Technology (KIST), Seoul, South Korea
Yun Jang
Yun Jang
Sejong University
Visualization