🤖 AI Summary
This study systematically evaluates the effectiveness and theoretical boundaries of large language models (LLMs) in macroeconomic time series forecasting. Using the FRED-MD dataset, it conducts a unified benchmark comparing Llama-3, GPT-4, and other LLMs against traditional econometric methods—including VAR, DSGE, and ARIMA—under consistent evaluation protocols. Methodologically, it innovatively integrates fine-tuning, prompt engineering, temporal embedding adaptation, and ensemble-based calibration to rigorously assess LLMs’ generalization capacity in macro forecasting for the first time. Results indicate that LLMs achieve competitive or superior accuracy for short-horizon (1–3-step) high-frequency indicators (e.g., unemployment rate, consumer confidence index), yet exhibit significant degradation in long-horizon forecasts and for low-frequency supply-side variables (e.g., GDP deflator), revealing fundamental limitations in modeling structural breaks and economic mechanisms. The study thus delineates the precise applicability domains and theoretical constraints of LLMs in macroeconomic prediction.
📝 Abstract
This paper presents a comparative analysis evaluating the accuracy of Large Language Models (LLMs) against traditional macro time series forecasting approaches. In recent times, LLMs have surged in popularity for forecasting due to their ability to capture intricate patterns in data and quickly adapt across very different domains. However, their effectiveness in forecasting macroeconomic time series data compared to conventional methods remains an area of interest. To address this, we conduct a rigorous evaluation of LLMs against traditional macro forecasting methods, using as common ground the FRED-MD database. Our findings provide valuable insights into the strengths and limitations of LLMs in forecasting macroeconomic time series, shedding light on their applicability in real-world scenarios