🤖 AI Summary
State-space models (SSMs) suffer from high computational and memory overhead on CPUs/GPUs due to the need to numerically solve differential equations for continuous-time integration, hindering efficient long-sequence modeling. This work proposes EpochCore, a domain-specific systolic-array accelerator for SSMs, featuring the novel LIMA-PE—multi-functional processing elements—and ProDF—a unified dataflow architecture—jointly optimized for both conventional DNNs and SSMs. Crucially, EpochCore is the first hardware design to directly accelerate continuous-time integration, enabling end-to-end SSM acceleration. Evaluations show that EpochCore achieves 250× higher throughput and 45× better energy efficiency than a general-purpose systolic array. On the Long Range Arena (LRA) benchmark, it reduces inference latency by 2000× compared to GPU execution, substantially overcoming key hardware acceleration bottlenecks for SSMs.
📝 Abstract
Sequence modeling is crucial for AI to understand temporal data and detect complex time-dependent patterns. While recurrent neural networks (RNNs), convolutional neural networks (CNNs), and Transformers have advanced in capturing long-range dependencies, they struggle with achieving high accuracy with very long sequences due to limited memory retention (fixed context window). State-Space Models (SSMs) leverage exponentially decaying memory enabling lengthy context window and so they process very long data sequences more efficiently than recurrent and Transformer-based models. Unlike traditional neural models like CNNs and RNNs, SSM-based models require solving differential equations through continuous integration, making training and inference both compute- and memory-intensive on conventional CPUs and GPUs. In this paper we introduce a specialized hardware accelerator, EpochCore, for accelerating SSMs. EpochCore is based on systolic arrays (SAs) and is designed to enhance the energy efficiency and throughput of inference of SSM-based models for long-range sequence tasks. Within the SA, we propose a versatile processing element (PE) called LIMA-PE to perform traditional and specialized MAC operations to support traditional DNNs and SSMs. To complement the EpochCore microarchitecture, we propose a novel dataflow, ProDF, which enables highly efficient execution of SSM-based models. By leveraging the LIMA-PE microarchitecture and ProDF, EpochCore achieves on average 250x gains in performance and 45x improvement in energy efficiency, at the expense of 2x increase in area cost over traditional SA-based accelerators, and around ~2,000x improvement in latency/inference on LRA datasets compared to GPU kernel operations.