FAME: Forecastability-Aware Mixture of Experts for Heterogeneous Time Series Forecasting

📅 2026-06-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Heterogeneous time series forecasting poses significant challenges for single models, while dense ensemble methods incur high inference costs and fail to model expert suitability effectively. This work proposes a sparse mixture-of-experts framework that constructs multidimensional predictability fingerprints for time series, learns expert applicability from validation performance, and employs a cost-aware router to dynamically activate only a few experts per series. By reframing model selection as predictability pattern mining and expert specialization modeling, the approach enables data-driven sparse routing. Evaluated on an industrial dataset comprising over 5,000 vending machines and 60 million transactions, the Top-2 configuration reduces mean squared error by 12.4% compared to the strongest single expert (LightGBM), with an average of only 1.92 experts invoked per time series.
📝 Abstract
Large-scale retail and industrial forecasting systems contain many heterogeneous time series whose lifecycle, sparsity, volatility, seasonality, spectral patterns, and contextual sensitivity differ substantially. A single forecasting model rarely performs well across all regimes, while dense ensembles increase inference cost and provide limited insight into expert suitability. This paper studies forecastability-aware expert routing: learning how data characteristics determine the suitability of forecasting experts. We propose \method{}, a sparse mixture-of-experts framework that represents each series with a multidimensional forecastability fingerprint, mines expert-suitability targets from validation performance, and trains a cost-aware sparse router to activate a small budgeted set of experts for each series. Using a production-scale vending-machine sales dataset from Shandong New Beiyang (SNBC), where the forecasting component has been integrated into the replenishment-planning pipeline, together with public retail benchmarks, we show that expert suitability varies systematically across data regimes. On the industrial dataset with 5,000+ machines and 60M+ transactions, \method{} Top-2 reduces MSE by 12.4\% over the strongest single expert, LightGBM, while executing 1.92 experts per series on average. The deployed component produces demand forecasts, while inventory-oriented gains are estimated by an offline replay simulator under a fixed replenishment policy rather than by online intervention. The framework turns heterogeneous sales forecasting from heuristic model selection into data mining of forecastability patterns and expert specialization. Code is available at https://github.com/hit636/FAME
Problem

Research questions and friction points this paper is trying to address.

heterogeneous time series
forecastability
expert routing
mixture of experts
forecasting
Innovation

Methods, ideas, or system contributions that make the work stand out.

forecastability-aware routing
mixture of experts
heterogeneous time series forecasting
sparse router
expert suitability