🤖 AI Summary
This paper investigates how multi-pass training affects data scaling laws for large language models (LLMs) under finite-data regimes—specifically, how much the dataset size must be increased to match the performance of K-pass training on a dataset of size N using only single-pass training.
Method: Leveraging theoretical analysis of linear regression and stochastic gradient descent (SGD), under strong convexity and Zipf-distributed data assumptions, we introduce the “effective reuse rate” (E(K, N)) to quantify marginal gains from repeated passes.
Contribution/Results: We prove that (E(K, N) approx K) for small (K), but saturates asymptotically; under strong convexity, the saturation level is (Theta(log N)), revealing that dataset scale fundamentally caps the marginal benefit of additional passes. Empirical validation on LLMs confirms this scaling behavior, refuting the common assumption of constant per-pass gains. Our framework provides a theoretically grounded, quantifiable basis for optimizing training strategies—especially in low-data settings.
📝 Abstract
While data scaling laws of large language models (LLMs) have been widely examined in the one-pass regime with massive corpora, their form under limited data and repeated epochs remains largely unexplored. This paper presents a theoretical analysis of how a common workaround, training for multiple epochs on the same dataset, reshapes the data scaling laws in linear regression. Concretely, we ask: to match the performance of training on a dataset of size $N$ for $K$ epochs, how much larger must a dataset be if the model is trained for only one pass? We quantify this using the extit{effective reuse rate} of the data, $E(K, N)$, which we define as the multiplicative factor by which the dataset must grow under one-pass training to achieve the same test loss as $K$-epoch training. Our analysis precisely characterizes the scaling behavior of $E(K, N)$ for SGD in linear regression under either strong convexity or Zipf-distributed data: (1) When $K$ is small, we prove that $E(K, N) approx K$, indicating that every new epoch yields a linear gain; (2) As $K$ increases, $E(K, N)$ plateaus at a problem-dependent value that grows with $N$ ($Θ(log N)$ for the strongly-convex case), implying that larger datasets can be repeated more times before the marginal benefit vanishes. These theoretical findings point out a neglected factor in a recent empirical study (Muennighoff et al. (2023)), which claimed that training LLMs for up to $4$ epochs results in negligible loss differences compared to using fresh data at each step, extit{i.e.}, $E(K, N) approx K$ for $K le 4$ in our notation. Supported by further empirical validation with LLMs, our results reveal that the maximum $K$ value for which $E(K, N) approx K$ in fact depends on the data size and distribution, and underscore the need to explicitly model both factors in future studies of scaling laws with data reuse.