🤖 AI Summary
Non-stationarity often degrades the predictive performance of causal large time series models, and conventional normalization methods may inadvertently introduce future information leakage during training. This work systematically evaluates multiple normalization strategies—including causal normalization and normalization based on statistics from initial observations—within a causal Transformer architecture, combined with time series chunking. It presents the first large-scale comparison of their impact on training stability and forecasting accuracy in autoregressive modeling. The study demonstrates that the choice of normalization critically determines model performance and offers key practical guidance for avoiding information leakage and enhancing the efficiency of causal time series modeling.
📝 Abstract
Large models for time-series forecasting have been emerged as a promising paradigm for training models on heterogeneous collections of signals. These models typically rely on causal autoregressive architectures, where each observation is sequentially predicted from past. In practice, real-world time-series exhibit non-stationarities, which significantly influence predictive performance. To mitigate this, normalization is commonly employed. However, in efficient causal settings it might induce information leakage from future observations during training. Recent alternatives, including causal normalization and statistics computed from initial observations, have been proposed to address this issue, but their practical implications remain insufficiently understood. In this work, we evaluate normalization strategies for transformer-based large time-series models trained with patching and efficient causal strategy. We showcase that normalization choice significantly influences both training convergence and forecasting performance.