🤖 AI Summary
Challenged by the limited efficacy of Transformers in time-series forecasting (TSF), the insufficient robustness of LLM-based approaches, and the over-dominance of recent observations, this paper proposes the first Transformer-gated Mixture-of-Experts (MoE) framework integrating multiple state-of-the-art paradigms. The framework unifies four heterogeneous models—xLSTM, an enhanced linear model, PatchTST, and minGRU—under a learnable Transformer-based gating network for dynamic expert weighting. It further introduces a recency-prioritized temporal weighting scheme to strengthen local dynamics modeling. Distinct from existing MoE methods, this work achieves cross-architectural complementarity within a single unified architecture, significantly improving both accuracy and robustness. Extensive experiments demonstrate consistent superiority over leading TSF models—including TimeLLM—across multiple standard benchmarks, empirically validating the effectiveness of heterogeneous model collaboration.
📝 Abstract
The immense success of the Transformer architecture
in Natural Language Processing has led to its adoption in Time Se ries Forecasting (TSF), where superior performance has been shown.
However, a recent important paper questioned their effectiveness by
demonstrating that a simple single layer linear model outperforms
Transformer-based models. This was soon shown to be not as valid,
by a better transformer-based model termed PatchTST. More re cently, TimeLLM demonstrated even better results by repurposing a
Large Language Model (LLM) for the TSF domain. Again, a follow
up paper challenged this by demonstrating that removing the LLM
component or replacing it with a basic attention layer in fact yields
better performance. One of the challenges in forecasting is the fact
that TSF data favors the more recent past, and is sometimes subject
to unpredictable events. Based upon these recent insights in TSF, we
propose a strong Mixture of Experts (MoE) framework. Our method
combines the state-of-the-art (SOTA) models including xLSTM, en hanced Linear, PatchTST, and minGRU, among others. This set of
complimentary and diverse models for TSF are integrated in a Trans former based MoE gating network. Our proposed model outperforms
all existing TSF models on standard benchmarks, surpassing even the
latest approaches based on MoE frameworks.