The Memorization Problem: Can We Trust LLMs' Economic Forecasts?

📅 2025-04-20

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Large language models (LLMs) exhibit “selective precise memory” for pre-cutoff economic data (e.g., GDP, stock prices, news headlines), causing apparent historical “forecasts” to be retrospective recall rather than genuine economic reasoning. Method: We propose a cross-model (Llama/GPT/Claude), multi-task consistency evaluation framework integrating numerical recall tests, masked reconstruction, and boundary instruction robustness analysis. Contribution/Results: We provide the first empirical evidence that LLMs achieve 100% recall accuracy for key economic indicators within their training window—behavior impervious to instruction-based constraints or masking interventions. Post-knowledge-cutoff forecasting performance collapses to zero, confirming memory—not reasoning—drives historical-period performance. This work is the first to systematically identify, characterize, and quantify memory artifacts in LLM-based economic forecasting, establishing a new paradigm for trustworthy AI evaluation.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) cannot be trusted for economic forecasts during periods covered by their training data. We provide the first systematic evaluation of LLMs' memorization of economic and financial data, including major economic indicators, news headlines, stock returns, and conference calls. Our findings show that LLMs can perfectly recall the exact numerical values of key economic variables from before their knowledge cutoff dates. This recall appears to be randomly distributed across different dates and data types. This selective perfect memory creates a fundamental issue -- when testing forecasting capabilities before their knowledge cutoff dates, we cannot distinguish whether LLMs are forecasting or simply accessing memorized data. Explicit instructions to respect historical data boundaries fail to prevent LLMs from achieving recall-level accuracy in forecasting tasks. Further, LLMs seem exceptional at reconstructing masked entities from minimal contextual clues, suggesting that masking provides inadequate protection against motivated reasoning. Our findings raise concerns about using LLMs to forecast historical data or backtest trading strategies, as their apparent predictive success may merely reflect memorization rather than genuine economic insight. Any application where future knowledge would change LLMs' outputs can be affected by memorization. In contrast, consistent with the absence of data contamination, LLMs cannot recall data after their knowledge cutoff date.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' memorization of economic and financial data

Distinguishing forecasting from memorized data recall in LLMs

Assessing risks of using LLMs for historical data forecasting

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates LLMs' memorization of economic data

Tests forecasting vs. memorization in LLMs

Examines masking effectiveness in data protection

🔎 Similar Papers

Macroeconomic Forecasting with Large Language Models