🤖 AI Summary
Existing scalable data attribution methods typically assume additive sample utility, limiting their ability to capture interactions such as redundancy and complementarity among subsets. This work proposes the first attribution framework that simultaneously models subset interactions and maintains computational efficiency at scale. The approach reformulates attribution as a subset-level counterfactual utility prediction task and introduces a geometry-aware quadratic penalty term to explicitly encode subset interactions. By integrating low-dimensional feature sketches with a finite-confidence lower-bound selection protocol, the method achieves high computational efficiency without relying on hidden oracle-based hyperparameter tuning. A theoretically grounded smoothness lower bound informs a geometric regularization mechanism, yielding over a two-fold improvement in task-level ranking correlation under subset retraining evaluation, while reducing upfront computational cost by nearly an order of magnitude. The framework demonstrates practical efficacy in language model data pruning and cross-domain visual data selection.
📝 Abstract
Scalable data attribution methods typically assign isolated utility scores to individual training examples. This prevalent additive assumption fundamentally fails to capture critical subset dynamics, including data redundancy and complementary coverage. In this work, we reframe attribution as subset-level counterfactual utility prediction and introduce GRASP, an interaction-aware surrogate. Grounded in a theoretical smoothness lower bound, GRASP explicitly models subset interactions through a quadratic geometric penalty. To achieve pretraining-scale efficiency without relying on hidden oracle tuning, we couple low-dimensional feature sketches with a strictly finite lower-confidence bound selection protocol. Extensive subset-retraining evaluations demonstrate that GRASP decisively outperforms existing scalable baselines. It more than doubles the task-level rank correlation for counterfactual subset fidelity while reducing upfront artifact construction costs by nearly an order of magnitude. Downstream diagnostics further show that this scoring mechanism transfers to language model curation and cross-domain vision selection, establishing a robust foundation for optimizing massive pretraining corpora.