🤖 AI Summary
This work addresses the challenge of realizing Strassen’s theoretical speedup for matrix multiplication on hardware, particularly FPGAs. We propose a hierarchical systolic array architecture tailored for FPGA implementation. Methodologically, we introduce the first customized multi-systolic array supporting recursive Strassen expansion and establish, for the first time, a quantitative relationship between recursion depth $ r $ and DSP resource savings—achieving up to $ 1.14^r $-fold DSP reduction—while enabling systematic mapping and holistic optimization from recursion level to hardware resources. Contributions include: (1) high computational utilization for 32×32 and 24×24 matrices; (2) significant DSP reduction with negligible increase in soft logic overhead; and (3) end-to-end integration into an ML accelerator achieving state-of-the-art (SOTA) performance. This is the first hardware-level closed-loop validation demonstrating Strassen’s resource efficiency advantage.
📝 Abstract
While Strassen's matrix multiplication algorithm reduces the complexity of naive matrix multiplication, general-purpose hardware is not suitable for achieving the algorithm's promised theoretical speedups. This leaves the question of if it could be better exploited in custom hardware architectures designed specifically for executing the algorithm. However, there is limited prior work on this and it is not immediately clear how to derive such architectures or if they can ultimately lead to real improvements. We bridge this gap, presenting and evaluating new systolic array architectures that efficiently translate the theoretical complexity reductions of Strassen's algorithm directly into hardware resource savings. Furthermore, the architectures are multisystolic array designs that can multiply smaller matrices with higher utilization than single-systolic array designs. The proposed designs implemented on FPGA reduce DSP requirements by a factor of $1.14^r$ for $r$ implemented Strassen recursion levels, and otherwise require overall similar soft logic resources when instantiated to support matrix sizes down to 32x32 and 24x24 at 1-2 levels of Strassen recursion, respectively. We evaluate the proposed designs both in isolation and in an end-to-end machine learning accelerator compared to baseline designs and prior works, achieving state-of-the-art performance.