🤖 AI Summary
Financial time series forecasting remains highly challenging due to low signal-to-noise ratios, heavy-tailed distributions, and regime-switching dynamics, while real-world data often hinders diagnosing model failures and assessing tail risk. To address this, this work proposes FinStressTS, a mechanism-aware synthetic benchmark comprising 30 controlled diagnostic environments derived from six parameterized stochastic processes—such as volatility clustering, self-exciting jumps, and zero inflation—that explicitly link model performance to interpretable generative mechanisms for the first time. Systematic evaluation of 15 classical and deep learning models using NMAE, CRPS, and learning curves reveals that model efficacy is strongly mechanism-dependent: autoregressive and linear models consistently outperform Transformers across most scenarios; distributional alignment is critical for probabilistic calibration; and neural models only achieve significant gains when modeling implicit mechanisms or complex distributions, typically requiring substantially more data to surpass simple baselines.
📝 Abstract
Financial forecasting is difficult due to low signal-to-noise ratios, latent factors, heavy tails, regime shifts, and jumps. Real-world benchmarks offer limited failure attribution: researchers can observe underperformance, but often cannot isolate why because mechanisms are unobservable and entangled. Real financial data reveal only one realized path, making it difficult to assess tail-risk calibration or data efficiency. We introduce FinStressTS, a mechanism-aware synthetic benchmark that links model behavior to controlled structural causes. FinStressTS comprises 30 diagnostic environments around six mechanism families: volatility clustering, multi-scale persistence, heavy-tailed shocks, regime switching, self-exciting jumps, and zero-inflated processes. We evaluate two tasks: point forecasting, using NMAE across five settings, and probabilistic forecasting, using CRPS under known data-generating mechanisms. We benchmark 15 models, from classical methods (HAR, VAR) to Transformer forecasters (PatchTST, iTransformer) and deep probabilistic architectures (DeepAR, TSFlow), and use learning curves to measure sample efficiency. Our evaluation reveals three insights. First, performance is mechanism-dependent: autoregressive and linear models are highly competitive, and often outperform Transformer-based models, in several volatility-, tail-, and jump-driven environments. Second, distributional alignment matters: parametric probabilistic models such as DeepAR calibrate well in stationary settings, while flexible models can help when distributions become multimodal or sparse. Third, neural models often require more data to match simple baselines, with larger gains mainly when learning latent regimes or complex distributions. FinStressTS provides an open framework for diagnosing failure modes and advancing risk-aware forecasting.