🤖 AI Summary
This work addresses the high computational cost and reliance on local approximations in existing data attribution methods for large language models, which typically operate in high-dimensional parameter spaces. The authors propose a novel approach that reformulates the problem as a sparse recovery task in activation space. By learning a lightweight “steering operator” to emulate the influence of training subsets on model behavior, they perform sparse linear decomposition based on the perturbations this operator induces on test predictions, enabling efficient inference of individual sample contributions. The method eliminates the need for repeated training or gradient tracking, achieving state-of-the-art performance on pretraining attribution tasks with a 13× speedup over current approaches. It further demonstrates broad applicability to data selection, contamination detection, and qualitative analysis.
📝 Abstract
Training Data Attribution (TDA) seeks to trace a model's predictions back to its training data. The gold standard for TDA relies on causal interventions, observing how a model changes when data is added or removed, but repeated retraining is computationally challenging for Large Language Models (LLMs). Consequently, most approaches approximate this effect in the parameter space using gradients. However, tracking gradients across billions of parameters is not only prohibitively expensive but relies on local approximations. In this work, we propose a shift: rather than estimating parameter changes, we model the functional effect of training data in the activation space. We introduce STRIDE (Steering-based Training Data Influence Decomposition), a framework that formulates TDA as a sparse recovery problem in the spirit of compressive sensing. STRIDE learns lightweight "steering operators" that mimic the behavioral shift caused by training on data subsets. By measuring how these operators perturb test predictions, we recover individual training example influences via sparse linear decomposition. STRIDE achieves state-of-the-art for LLM pre-training attribution while being an order of magnitude ($13\times$) faster than previous art. We further validate its practical utility through downstream applications including data selection, data contamination, and qualitative analysis.