Accumulative SGD Influence Estimation for Data Attribution

📅 2025-10-30

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Accurate sample-level influence estimation is critical for modern data-centric AI, yet existing SGD-based influence estimation (SGD-IE) methods accumulate proxy gradients epoch-wise, ignoring cross-epoch compounding effects and yielding biased critical-sample rankings. To address this, we propose Trajectory-Aware ACC-SGD-IE—a novel influence estimation framework that introduces cross-epoch perturbation propagation for the first time. Under strong convexity, it achieves geometric error convergence; under non-convexity, it attains a tighter error bound. By integrating cumulative influence-state updates with per-step gradient proxy computation, ACC-SGD-IE supports large mini-batch optimization. Experiments on Adult, 20 Newsgroups, and MNIST demonstrate that ACC-SGD-IE significantly outperforms SGD-IE in long-training and noisy-data regimes, leading to improved data cleaning efficacy and enhanced downstream model performance.

Technology Category

Application Category

📝 Abstract

Modern data-centric AI needs precise per-sample influence. Standard SGD-IE approximates leave-one-out effects by summing per-epoch surrogates and ignores cross-epoch compounding, which misranks critical examples. We propose ACC-SGD-IE, a trajectory-aware estimator that propagates the leave-one-out perturbation across training and updates an accumulative influence state at each step. In smooth strongly convex settings it achieves geometric error contraction and, in smooth non-convex regimes, it tightens error bounds; larger mini-batches further reduce constants. Empirically, on Adult, 20 Newsgroups, and MNIST under clean and corrupted data and both convex and non-convex training, ACC-SGD-IE yields more accurate influence estimates, especially over long epochs. For downstream data cleansing it more reliably flags noisy samples, producing models trained on ACC-SGD-IE cleaned data that outperform those cleaned with SGD-IE.

Problem

Research questions and friction points this paper is trying to address.

Estimating precise per-sample influence in data-centric AI systems

Addressing misranking of critical examples in SGD influence estimation

Improving data cleansing by reliably identifying noisy training samples

Innovation

Methods, ideas, or system contributions that make the work stand out.

Propagates leave-one-out perturbation across training

Updates accumulative influence state each step

Achieves geometric error contraction in convex settings

🔎 Similar Papers

No similar papers found.

Authors to Follow