๐ค AI Summary
This study addresses the absence of a unified framework for quantifying the extent to which large language models (LLMs) rely on pretraining knowledge when predicting human behavior. To this end, it introduces โequivalent sample sizeโ as a novel metric that estimates the amount of task-specific data required for an LLM to achieve its observed predictive accuracy. The authors develop an asymptotic statistical inference framework by integrating flexible machine learning techniques, cross-validation, and comparative analysis of prediction errors. Empirical validation on dynamic panel data of household income reveals that LLMs encode substantial predictive information for certain economic variables but limited utility for others, demonstrating that their value as a substitute for domain-specific data is highly context-dependent.
๐ Abstract
Large language models (LLMs) are increasingly used to predict human behavior. We propose a measure for evaluating how much knowledge a pretrained LLM brings to such a prediction: its equivalent sample size, defined as the amount of task-specific data needed to match the predictive accuracy of the LLM. We estimate this measure by comparing the prediction error of a fixed LLM in a given domain to that of flexible machine learning models trained on increasing samples of domain-specific data. We further provide a statistical inference procedure by developing a new asymptotic theory for cross-validated prediction error. Finally, we apply this method to the Panel Study of Income Dynamics. We find that LLMs encode considerable predictive information for some economic variables but much less for others, suggesting that their value as substitutes for domain-specific data differs markedly across settings.