🤖 AI Summary
Standard Transformers’ fully connected attention mechanism neglects the inherent causality and locality of time series, limiting predictive performance. To address this, we propose Weighted Causal Attention (WCA), a novel attention mechanism that introduces a learnable weight function based on smooth heavy-tailed decay—thereby encoding temporal locality as an end-to-end differentiable inductive bias. WCA integrates strict causal masking with principled power-law decay, yielding a Transformer variant that balances architectural flexibility with interpretability. Evaluated across multiple mainstream time-series forecasting benchmarks, our approach achieves state-of-the-art accuracy. Moreover, the learned attention weights exhibit clear, monotonic temporal decay patterns—empirically confirming that explicit temporal priors enhance both model performance and interpretability.
📝 Abstract
Transformers have recently shown strong performance in time-series forecasting, but their all-to-all attention mechanism overlooks the (temporal) causal and often (temporally) local nature of data. We introduce Powerformer, a novel Transformer variant that replaces noncausal attention weights with causal weights that are reweighted according to a smooth heavy-tailed decay. This simple yet effective modification endows the model with an inductive bias favoring temporally local dependencies, while still allowing sufficient flexibility to learn the unique correlation structure of each dataset. Our empirical results demonstrate that Powerformer not only achieves state-of-the-art accuracy on public time-series benchmarks, but also that it offers improved interpretability of attention patterns. Our analyses show that the model's locality bias is amplified during training, demonstrating an interplay between time-series data and power-law-based attention. These findings highlight the importance of domain-specific modifications to the Transformer architecture for time-series forecasting, and they establish Powerformer as a strong, efficient, and principled baseline for future research and real-world applications.