Data Attribution in Large Language Models via Bidirectional Gradient Optimization

📅 2026-06-03

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This work addresses the challenge of tracing the influence of training data on the outputs of large language models by proposing a fine-grained attribution framework grounded in a backward-view perspective. The method employs bidirectional gradient optimization—both ascent and descent—to perturb the base model, integrating loss sensitivity analysis with autoregressive perturbation techniques to quantify how individual training samples affect generated text in terms of both factual accuracy and stylistic characteristics. By measuring changes in training sample loss induced by generated outputs, the framework enables attribution at arbitrary levels of granularity. Experimental results demonstrate that this approach significantly outperforms existing influence estimation methods on pretrained models, offering the first technique capable of precise dual-dimensional attribution across factual and stylistic attributes, thereby enhancing the interpretability, trustworthiness, and governability of large language models.

📝 Abstract

Large Language Models (LLMs) are increasingly deployed across diverse applications, raising critical questions for governance, accountability, and data provenance. Understanding which training data most influenced a model's output remains a fundamental open problem. We address this challenge through training data attribution (TDA) for auto-regressive LLMs by expanding upon the inverse formulation: How would training data be affected if the model had seen the generated output during training? Our method perturbs the base model using bidirectional gradient optimization (gradient ascent and descent) on a generated text sample and measures the resulting change in loss across training samples. Our framework supports attribution at arbitrary data granularity, enabling both factual and stylistic attribution. We evaluate our method against baselines on pretrained models with known datasets, and show that it outperforms previous work on influence metrics, thereby enhancing model interpretability, an essential requirement for accountable AI systems.

Problem

Research questions and friction points this paper is trying to address.

Data Attribution

Large Language Models

Training Data Influence

Model Interpretability

Data Provenance

Innovation

Methods, ideas, or system contributions that make the work stand out.

training data attribution

bidirectional gradient optimization

large language models