DUPRE: Data Utility Prediction for Efficient Data Valuation

📅 2025-02-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational cost of repeated model retraining in data valuation, this paper proposes a retraining-free, efficient Shapley value estimation method. The core idea is to employ Gaussian Process Regression (GPR) to directly predict the utility (e.g., validation accuracy) of arbitrary data subsets, thereby bypassing exhaustive model training over exponentially many subsets. A novel GPR kernel is introduced, based on the sliced Wasserstein distance, which simultaneously ensures positive definiteness and captures semantic similarity between data distributions—enabling prior-informed utility prediction. Extensive experiments across multiple models, datasets, and utility functions demonstrate that the method significantly reduces prediction error, accelerates data valuation by orders of magnitude, and maintains high fidelity in Shapley value estimation compared to conventional retraining-based approaches.

Technology Category

Application Category

📝 Abstract
Data valuation is increasingly used in machine learning (ML) to decide the fair compensation for data owners and identify valuable or harmful data for improving ML models. Cooperative game theory-based data valuation, such as Data Shapley, requires evaluating the data utility (e.g., validation accuracy) and retraining the ML model for multiple data subsets. While most existing works on efficient estimation of the Shapley values have focused on reducing the number of subsets to evaluate, our framework, exttt{DUPRE}, takes an alternative yet complementary approach that reduces the cost per subset evaluation by predicting data utilities instead of evaluating them by model retraining. Specifically, given the evaluated data utilities of some data subsets, exttt{DUPRE} fits a emph{Gaussian process} (GP) regression model to predict the utility of every other data subset. Our key contribution lies in the design of our GP kernel based on the sliced Wasserstein distance between empirical data distributions. In particular, we show that the kernel is valid and positive semi-definite, encodes prior knowledge of similarities between different data subsets, and can be efficiently computed. We empirically verify that exttt{DUPRE} introduces low prediction error and speeds up data valuation for various ML models, datasets, and utility functions.
Problem

Research questions and friction points this paper is trying to address.

Predicts data utility efficiently
Reduces data valuation cost
Uses Gaussian process regression
Innovation

Methods, ideas, or system contributions that make the work stand out.

Predicts data utilities via Gaussian process
Uses sliced Wasserstein distance kernel
Reduces cost in data valuation
🔎 Similar Papers
No similar papers found.
K
Kieu Thao Nguyen Pham
National University of Singapore, Singapore
R
Rachael Hwee Ling Sim
National University of Singapore, Singapore
Quoc Phong Nguyen
Quoc Phong Nguyen
A2I2 - Deakin University
machine learningartificial intelligence
S
See Kiong Ng
National University of Singapore, Singapore
Bryan Kian Hsiang Low
Bryan Kian Hsiang Low
Associate Professor (with tenure), Department of Computer Science, National University of Singapore
Bayesian OptimizationGaussian ProcessesFederated LearningData-centric AIData Valuation