Seg-MoE: Multi-Resolution Segment-wise Mixture-of-Experts for Time Series Forecasting Transformers

📅 2026-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing Transformer-based time series forecasting models, which struggle to effectively capture long-term dynamics at scale, and conventional Mixture-of-Experts (MoE) architectures that employ token-level routing, thereby disregarding the local continuity inherent in temporal data. To overcome these issues, the authors propose Seg-MoE, the first MoE framework for time series that introduces segment-level routing. By partitioning consecutive time steps into coherent segments and routing each segment as a whole to specialized experts, Seg-MoE enables experts to directly model intra-segment temporal interactions, aligning with the natural structure of time series. Extensive experiments demonstrate that Seg-MoE achieves state-of-the-art performance across multiple multivariate long-term forecasting benchmarks, significantly outperforming both dense Transformers and existing token-level MoE approaches.

Technology Category

Application Category

📝 Abstract
Transformer-based models have recently made significant advances in accurate time-series forecasting, but even these architectures struggle to scale efficiently while capturing long-term temporal dynamics. Mixture-of-Experts (MoE) layers are a proven solution to scaling problems in natural language processing. However, existing MoE approaches for time-series forecasting rely on token-wise routing mechanisms, which may fail to exploit the natural locality and continuity of temporal data. In this work, we introduce Seg-MoE, a sparse MoE design that routes and processes contiguous time-step segments rather than making independent expert decisions. Token segments allow each expert to model intra-segment interactions directly, naturally aligning with inherent temporal patterns. We integrate Seg-MoE layers into a time-series Transformer and evaluate it on multiple multivariate long-term forecasting benchmarks. Seg-MoE consistently achieves state-of-the-art forecasting accuracy across almost all prediction horizons, outperforming both dense Transformers and prior token-wise MoE models. Comprehensive ablation studies confirm that segment-level routing is the key factor driving these gains. Our results show that aligning the MoE routing granularity with the inherent structure of time series provides a powerful, yet previously underexplored, inductive bias, opening new avenues for conditionally sparse architectures in sequential data modeling.
Problem

Research questions and friction points this paper is trying to address.

time series forecasting
Mixture-of-Experts
temporal locality
segment-wise routing
long-term dynamics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Segment-wise Routing
Mixture-of-Experts
Time Series Forecasting
Transformer
Sparse Architecture
🔎 Similar Papers
No similar papers found.