🤖 AI Summary
In large language model (LLM) training under constrained data budgets, existing data valuation methods suffer from prohibitive computational costs, poor scalability, and reliance on expensive second-order curvature estimation.
Method: This paper introduces ALinFiK—the first third-party, scalable, single-sample data valuation framework for LLMs—built upon the Linearized Future Influence Kernel (LinFiK) paradigm. LinFiK integrates gradient linearization, influence function approximation, and meta-learning-driven kernel estimation, enabling real-time data value quantification for models with tens of millions of parameters. It avoids Hessian computation entirely, leveraging Hessian-free second-order optimization and efficient sampling strategies.
Contribution/Results: On multiple LLM benchmarks, ALinFiK significantly outperforms state-of-the-art methods—including TracIn and Data Shapley—in both accuracy and efficiency: it achieves a 47× speedup in evaluation latency and processes over 200 million samples per day on a single GPU. This marks the first practical solution for high-accuracy, high-throughput, dynamic data valuation in large-scale LLM training.
📝 Abstract
Large Language Models (LLMs) heavily rely on high-quality training data, making data valuation crucial for optimizing model performance, especially when working within a limited budget. In this work, we aim to offer a third-party data valuation approach that benefits both data providers and model developers. We introduce a linearized future influence kernel (LinFiK), which assesses the value of individual data samples in improving LLM performance during training. We further propose ALinFiK, a learning strategy to approximate LinFiK, enabling scalable data valuation. Our comprehensive evaluations demonstrate that this approach surpasses existing baselines in effectiveness and efficiency, demonstrating significant scalability advantages as LLM parameters increase.