🤖 AI Summary
This work challenges the prevailing view that the reliance of time series forecasting on long context windows stems primarily from capturing long-range dependencies. Instead, it identifies a more fundamental reason: extended windows reduce uncertainty inherent in the data-generating process, thereby improving predictive performance. To formalize this insight, the authors introduce a dual-objective framework comprising Generative Process Identification (GPI) and Conditional Forecasting (CF). They theoretically prove that, even for sequences with memory length \( P \), the context window must strictly exceed \( P \) to achieve minimal prediction error. Furthermore, they decouple GPI from CF and demonstrate empirically—on both synthetic and real-world datasets—that this strategy not only elucidates the essential role of long windows but also significantly enhances computational efficiency and model scalability without compromising accuracy.
📝 Abstract
Modern deep learning models for forecasting groups of time series rely on increasingly longer observation windows. However, the benefit of increasing the window size is often simply attributed to capturing long-range dependencies, and broader discussion on how global forecasting models leverage input observations has been limited. In this paper, we show that forecasting groups of time series involves two objectives: (i) generative process identification (GPI), i.e., inferring the specific process generating the input sequence, and (ii) conditional forecasting (CF), i.e., predicting future values given input observations. From this perspective, optimal predictions can be interpreted as an average over plausible data-generating processes, weighted by their likelihood given the input window. This suggests another explanation for the benefits of long context windows: they reduce the uncertainty about which specific process is generating the input time series during operation. We prove that even for processes with memory length $P$, an input window size strictly larger than $P$ is necessary to achieve the minimum attainable error. Finally, we show how decoupling GPI and CF can improve computational scalability without compromising accuracy. Experiments on synthetic and real-world data validate our insights and their relevance for designing forecasting architectures.