🤖 AI Summary
Time-series modeling faces challenges including variable-length sequence handling, high feature redundancy, and limited generalization capability. To address these, we propose ConvFormer—a convolution-like multi-scale fusion framework that jointly employs temporal patching and multi-head attention to progressively compress the time dimension while expanding channel capacity. It introduces cross-scale attention and logarithmic-space normalization to enhance multi-scale feature interaction and suppress redundant representations. The resulting hierarchical time-series representation achieves significant improvements over state-of-the-art Transformer- and CNN-based baselines on both forecasting and classification tasks, reducing feature redundancy by 12.6%–28.4% and improving average performance by 3.7%–9.2%. Our core contribution lies in the first unified integration of convolution-like structural inductive bias, cross-scale attention, and logarithmic normalization within a Transformer architecture—effectively balancing local pattern modeling with global dependency capture.
📝 Abstract
Time series analysis faces significant challenges in handling variable-length data and achieving robust generalization. While Transformer-based models have advanced time series tasks, they often struggle with feature redundancy and limited generalization capabilities. Drawing inspiration from classical CNN architectures' pyramidal structure, we propose a Multi-Scale Representation Learning Framework based on a Conv-like ScaleFusion Transformer. Our approach introduces a temporal convolution-like structure that combines patching operations with multi-head attention, enabling progressive temporal dimension compression and feature channel expansion. We further develop a novel cross-scale attention mechanism for effective feature fusion across different temporal scales, along with a log-space normalization method for variable-length sequences. Extensive experiments demonstrate that our framework achieves superior feature independence, reduced redundancy, and better performance in forecasting and classification tasks compared to state-of-the-art methods.