Chunked Data Shapley: A Scalable Dataset Quality Assessment for Machine Learning

📅 2025-08-22

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

To address the low efficiency and poor scalability of data quality assessment on large-scale datasets, this paper proposes the Chunked Data Shapley framework—a novel approach that integrates Shapley values with data chunking, optimized subset sampling, and single-iteration stochastic gradient descent to enable high-accuracy, high-efficiency quality evaluation. By partitioning data into chunks and estimating their marginal contributions, the framework rapidly identifies high-value or low-quality regions, supporting both classification and regression tasks. Experiments across multiple real-world tabular datasets demonstrate that our method achieves 80×–2300× speedup over state-of-the-art baselines while significantly improving the accuracy of low-quality sample identification. The framework thus establishes a scalable, theoretically grounded, and interpretable paradigm for data quality assessment in large-scale machine learning data governance.

Technology Category

Application Category

📝 Abstract

As the volume and diversity of available datasets continue to increase, assessing data quality has become crucial for reliable and efficient Machine Learning analytics. A modern, game-theoretic approach for evaluating data quality is the notion of Data Shapley which quantifies the value of individual data points within a dataset. State-of-the-art methods to scale the NP-hard Shapley computation also face severe challenges when applied to large-scale datasets, limiting their practical use. In this work, we present a Data Shapley approach to identify a dataset's high-quality data tuples, Chunked Data Shapley (C-DaSh). C-DaSh scalably divides the dataset into manageable chunks and estimates the contribution of each chunk using optimized subset selection and single-iteration stochastic gradient descent. This approach drastically reduces computation time while preserving high quality results. We empirically benchmark our method on diverse real-world classification and regression tasks, demonstrating that C-DaSh outperforms existing Shapley approximations in both computational efficiency (achieving speedups between 80x - 2300x) and accuracy in detecting low-quality data regions. Our method enables practical measurement of dataset quality on large tabular datasets, supporting both classification and regression pipelines.

Problem

Research questions and friction points this paper is trying to address.

Scalable assessment of dataset quality for machine learning

Efficient computation of Data Shapley values for large datasets

Identifying high-quality data chunks while reducing computation time

Innovation

Methods, ideas, or system contributions that make the work stand out.

Chunked dataset division for scalable assessment

Optimized subset selection with gradient descent

Efficient computation preserving high-quality results

🔎 Similar Papers

No similar papers found.