π€ AI Summary
To address the low efficiency and poor scalability of data quality assessment on large-scale datasets, this paper proposes the Chunked Data Shapley frameworkβa novel approach that integrates Shapley values with data chunking, optimized subset sampling, and single-iteration stochastic gradient descent to enable high-accuracy, high-efficiency quality evaluation. By partitioning data into chunks and estimating their marginal contributions, the framework rapidly identifies high-value or low-quality regions, supporting both classification and regression tasks. Experiments across multiple real-world tabular datasets demonstrate that our method achieves 80Γβ2300Γ speedup over state-of-the-art baselines while significantly improving the accuracy of low-quality sample identification. The framework thus establishes a scalable, theoretically grounded, and interpretable paradigm for data quality assessment in large-scale machine learning data governance.
π Abstract
As the volume and diversity of available datasets continue to increase, assessing data quality has become crucial for reliable and efficient Machine Learning analytics. A modern, game-theoretic approach for evaluating data quality is the notion of Data Shapley which quantifies the value of individual data points within a dataset. State-of-the-art methods to scale the NP-hard Shapley computation also face severe challenges when applied to large-scale datasets, limiting their practical use. In this work, we present a Data Shapley approach to identify a dataset's high-quality data tuples, Chunked Data Shapley (C-DaSh). C-DaSh scalably divides the dataset into manageable chunks and estimates the contribution of each chunk using optimized subset selection and single-iteration stochastic gradient descent. This approach drastically reduces computation time while preserving high quality results. We empirically benchmark our method on diverse real-world classification and regression tasks, demonstrating that C-DaSh outperforms existing Shapley approximations in both computational efficiency (achieving speedups between 80x - 2300x) and accuracy in detecting low-quality data regions. Our method enables practical measurement of dataset quality on large tabular datasets, supporting both classification and regression pipelines.