Efficiently Verifiable Proofs of Data Attribution

📅 2025-08-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Resource-constrained parties struggle to trust data attribution results provided by computationally powerful entities—especially in critical downstream tasks such as data pricing. Method: We propose the first verifiable interactive proof framework for data attribution, introducing PAC (Probably Approximately Correct) verification to this domain for the first time. Our approach designs an efficient verification protocol based on linear functions over the Boolean hypercube, integrating interactive proofs with empirical influence estimation. Contribution: The verification cost is independent of dataset size and scales only as O(1/ε) with respect to accuracy parameter ε, guaranteeing attribution correctness with probability at least 1−δ. We provide formal guarantees of completeness, soundness, and efficiency—without requiring repeated training of numerous models. This framework enables scalable, statistically rigorous, and computationally lightweight verification of attribution outcomes, bridging a fundamental trust gap between data owners and service providers in data markets.

Technology Category

Application Category

📝 Abstract
Data attribution methods aim to answer useful counterfactual questions like "what would a ML model's prediction be if it were trained on a different dataset?" However, estimation of data attribution models through techniques like empirical influence or "datamodeling" remains very computationally expensive. This causes a critical trust issue: if only a few computationally rich parties can obtain data attributions, how can resource-constrained parties trust that the provided attributions are indeed "good," especially when they are used for important downstream applications (e.g., data pricing)? In this paper, we address this trust issue by proposing an interactive verification paradigm for data attribution. An untrusted and computationally powerful Prover learns data attributions, and then engages in an interactive proof with a resource-constrained Verifier. Our main result is a protocol that provides formal completeness, soundness, and efficiency guarantees in the sense of Probably-Approximately-Correct (PAC) verification. Specifically, if both Prover and Verifier follow the protocol, the Verifier accepts data attributions that are ε-close to the optimal data attributions (in terms of the Mean Squared Error) with probability 1-δ. Conversely, if the Prover arbitrarily deviates from the protocol, even with infinite compute, then this is detected (or it still yields data attributions to the Verifier) except with probability δ. Importantly, our protocol ensures the Verifier's workload, measured by the number of independent model retrainings it must perform, scales only as O(1/ε); i.e., independently of the dataset size. At a technical level, our results apply to efficiently verifying any linear function over the boolean hypercube computed by the Prover, making them broadly applicable to various attribution tasks.
Problem

Research questions and friction points this paper is trying to address.

Efficiently verify data attribution in ML models
Address trust issues in computationally expensive attributions
Provide interactive proof for resource-constrained verifiers
Innovation

Methods, ideas, or system contributions that make the work stand out.

Interactive verification for data attribution
PAC guarantees for attribution accuracy
Efficient Verifier workload O(1/ε)
🔎 Similar Papers
No similar papers found.