Integrated Influence: Data Attribution with Baseline

📅 2025-08-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing data attribution methods (e.g., Leave-One-Out) perturb only individual training samples, ignoring collective interactions among samples and lacking a principled baseline mechanism—thus hindering counterfactual interpretation. This paper proposes Integrated Influence, the first method to incorporate both a controllable baseline dataset and a data degradation path into data attribution. It constructs a well-defined baseline, formalizes a sample-wise degradation sequence, and accumulates influence along this path using an integral-gradient-inspired approach—enabling global, counterfactual attribution for test predictions. Integrated Influence unifies mainstream influence-based methods (e.g., influence functions) under a theoretically grounded, computationally efficient framework. Experiments demonstrate its superiority in tasks such as mislabeled sample detection, achieving significantly higher stability, accuracy, and generalization compared to state-of-the-art baselines.

Technology Category

Application Category

📝 Abstract
As an effective approach to quantify how training samples influence test sample, data attribution is crucial for understanding data and model and further enhance the transparency of machine learning models. We find that prevailing data attribution methods based on leave-one-out (LOO) strategy suffer from the local-based explanation, as these LOO-based methods only perturb a single training sample, and overlook the collective influence in the training set. On the other hand, the lack of baseline in many data attribution methods reduces the flexibility of the explanation, e.g., failing to provide counterfactual explanations. In this paper, we propose Integrated Influence, a novel data attribution method that incorporates a baseline approach. Our method defines a baseline dataset, follows a data degeneration process to transition the current dataset to the baseline, and accumulates the influence of each sample throughout this process. We provide a solid theoretical framework for our method, and further demonstrate that popular methods, such as influence functions, can be viewed as special cases of our approach. Experimental results show that Integrated Influence generates more reliable data attributions compared to existing methods in both data attribution task and mislablled example identification task.
Problem

Research questions and friction points this paper is trying to address.

Overcoming local-based limitations in data attribution methods
Addressing lack of baseline flexibility in attribution explanations
Improving reliability of data attribution and mislabel identification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Incorporates baseline approach for data attribution
Uses data degeneration process for collective influence
Provides theoretical framework unifying existing methods
🔎 Similar Papers
No similar papers found.