Accumulative SGD Influence Estimation for Data Attribution

📅 2025-10-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Accurate sample-level influence estimation is critical for modern data-centric AI, yet existing SGD-based influence estimation (SGD-IE) methods accumulate proxy gradients epoch-wise, ignoring cross-epoch compounding effects and yielding biased critical-sample rankings. To address this, we propose Trajectory-Aware ACC-SGD-IE—a novel influence estimation framework that introduces cross-epoch perturbation propagation for the first time. Under strong convexity, it achieves geometric error convergence; under non-convexity, it attains a tighter error bound. By integrating cumulative influence-state updates with per-step gradient proxy computation, ACC-SGD-IE supports large mini-batch optimization. Experiments on Adult, 20 Newsgroups, and MNIST demonstrate that ACC-SGD-IE significantly outperforms SGD-IE in long-training and noisy-data regimes, leading to improved data cleaning efficacy and enhanced downstream model performance.

Technology Category

Application Category

📝 Abstract
Modern data-centric AI needs precise per-sample influence. Standard SGD-IE approximates leave-one-out effects by summing per-epoch surrogates and ignores cross-epoch compounding, which misranks critical examples. We propose ACC-SGD-IE, a trajectory-aware estimator that propagates the leave-one-out perturbation across training and updates an accumulative influence state at each step. In smooth strongly convex settings it achieves geometric error contraction and, in smooth non-convex regimes, it tightens error bounds; larger mini-batches further reduce constants. Empirically, on Adult, 20 Newsgroups, and MNIST under clean and corrupted data and both convex and non-convex training, ACC-SGD-IE yields more accurate influence estimates, especially over long epochs. For downstream data cleansing it more reliably flags noisy samples, producing models trained on ACC-SGD-IE cleaned data that outperform those cleaned with SGD-IE.
Problem

Research questions and friction points this paper is trying to address.

Estimating precise per-sample influence in data-centric AI systems
Addressing misranking of critical examples in SGD influence estimation
Improving data cleansing by reliably identifying noisy training samples
Innovation

Methods, ideas, or system contributions that make the work stand out.

Propagates leave-one-out perturbation across training
Updates accumulative influence state each step
Achieves geometric error contraction in convex settings
🔎 Similar Papers
No similar papers found.
Y
Yunxiao Shi
University of Technology Sydney, Sydney, NSW, Australia
S
Shuo Yang
University of Technology Sydney, Sydney, NSW, Australia
Yixin Su
Yixin Su
Huazhong University of Science and Technology
Large Language ModelsPersonalizationRecommender systemsGraph Neural Networks
R
Rui Zhang
Wuhan, Hubei, China
M
Min Xu
University of Technology Sydney, Sydney, NSW, Australia