🤖 AI Summary
Accurate sample-level influence estimation is critical for modern data-centric AI, yet existing SGD-based influence estimation (SGD-IE) methods accumulate proxy gradients epoch-wise, ignoring cross-epoch compounding effects and yielding biased critical-sample rankings. To address this, we propose Trajectory-Aware ACC-SGD-IE—a novel influence estimation framework that introduces cross-epoch perturbation propagation for the first time. Under strong convexity, it achieves geometric error convergence; under non-convexity, it attains a tighter error bound. By integrating cumulative influence-state updates with per-step gradient proxy computation, ACC-SGD-IE supports large mini-batch optimization. Experiments on Adult, 20 Newsgroups, and MNIST demonstrate that ACC-SGD-IE significantly outperforms SGD-IE in long-training and noisy-data regimes, leading to improved data cleaning efficacy and enhanced downstream model performance.
📝 Abstract
Modern data-centric AI needs precise per-sample influence. Standard SGD-IE approximates leave-one-out effects by summing per-epoch surrogates and ignores cross-epoch compounding, which misranks critical examples. We propose ACC-SGD-IE, a trajectory-aware estimator that propagates the leave-one-out perturbation across training and updates an accumulative influence state at each step. In smooth strongly convex settings it achieves geometric error contraction and, in smooth non-convex regimes, it tightens error bounds; larger mini-batches further reduce constants. Empirically, on Adult, 20 Newsgroups, and MNIST under clean and corrupted data and both convex and non-convex training, ACC-SGD-IE yields more accurate influence estimates, especially over long epochs. For downstream data cleansing it more reliably flags noisy samples, producing models trained on ACC-SGD-IE cleaned data that outperform those cleaned with SGD-IE.