A case for data valuation transparency via DValCards

📅 2025-06-29

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

Data valuation methods in machine learning exhibit systemic bias and instability: preprocessing substantially perturbs valuation outcomes; subsampling exacerbates class imbalance; and minority-group data are systematically undervalued—leading to technical failure and ethical risks. This paper presents the first systematic empirical analysis of the fragility of six mainstream valuation methods under preprocessing, subsampling, and group distribution shifts, conducted across nine tabular datasets. To address these issues, we propose DValCards—a standardized transparency framework that mandates disclosure of a method’s applicability boundaries, sensitivity profiles, and group-level fairness performance. Experiments demonstrate that DValCards effectively identifies misuse scenarios, improves the reasonableness of data pricing, and enhances model decision traceability. By enabling rigorous, auditable evaluation of valuation methods, DValCards provides a practical governance tool for trustworthy data markets and equitable machine learning.

Technology Category

Application Category

📝 Abstract

Following the rise in popularity of data-centric machine learning (ML), various data valuation methods have been proposed to quantify the contribution of each datapoint to desired ML model performance metrics (e.g., accuracy). Beyond the technical applications of data valuation methods (e.g., data cleaning, data acquisition, etc.), it has been suggested that within the context of data markets, data buyers might utilize such methods to fairly compensate data owners. Here we demonstrate that data valuation metrics are inherently biased and unstable under simple algorithmic design choices, resulting in both technical and ethical implications. By analyzing 9 tabular classification datasets and 6 data valuation methods, we illustrate how (1) common and inexpensive data pre-processing techniques can drastically alter estimated data values; (2) subsampling via data valuation metrics may increase class imbalance; and (3) data valuation metrics may undervalue underrepresented group data. Consequently, we argue in favor of increased transparency associated with data valuation in-the-wild and introduce the novel Data Valuation Cards (DValCards) framework towards this aim. The proliferation of DValCards will reduce misuse of data valuation metrics, including in data pricing, and build trust in responsible ML systems.

Problem

Research questions and friction points this paper is trying to address.

Data valuation metrics are biased and unstable.

Pre-processing alters data values and increases imbalance.

Undervaluation of underrepresented group data occurs.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces DValCards for data valuation transparency

Analyzes bias in data valuation methods

Addresses ethical implications in ML systems

🔎 Similar Papers

No similar papers found.