🤖 AI Summary
To address two key challenges in time-series forecasting—insufficient modeling of dynamic timestamp correlations and performance degradation caused by anomalous segment noise—this paper proposes the Temporal Expert Mixture (TEM) attention mechanism. Within the Transformer framework, TEM jointly employs localized experts (modeling lagged dependencies via key-value pairs) and a shared global expert (capturing long-range temporal dependencies), synergistically optimizing attention weights through local filtering and global contextual modeling to enable adaptive selection of dynamic temporal contexts. Replacing the standard attention modules in PatchTST and Timer, TEM achieves significant improvements over state-of-the-art methods across seven long-horizon forecasting benchmarks, reducing average MAE by 3.2%–9.7%. Extensive experiments demonstrate TEM’s robustness to anomalies, strong generalization across diverse datasets, and practical effectiveness in real-world forecasting scenarios.
📝 Abstract
Transformer-based architectures dominate time series modeling by enabling global attention over all timestamps, yet their rigid 'one-size-fits-all' context aggregation fails to address two critical challenges in real-world data: (1) inherent lag effects, where the relevance of historical timestamps to a query varies dynamically; (2) anomalous segments, which introduce noisy signals that degrade forecasting accuracy. To resolve these problems, we propose the Temporal Mix of Experts (TMOE), a novel attention-level mechanism that reimagines key-value (K-V) pairs as local experts (each specialized in a distinct temporal context) and performs adaptive expert selection for each query via localized filtering of irrelevant timestamps. Complementing this local adaptation, a shared global expert preserves the Transformer's strength in capturing long-range dependencies. We then replace the vanilla attention mechanism in popular time-series Transformer frameworks (i.e., PatchTST and Timer) with TMOE, without extra structural modifications, yielding our specific version TimeExpert and general version TimeExpert-G. Extensive experiments on seven real-world long-term forecasting benchmarks demonstrate that TimeExpert and TimeExpert-G outperform state-of-the-art methods. Code is available at https://github.com/xwmaxwma/TimeExpert.