SISA: A Scale-In Systolic Array for GEMM Acceleration

๐Ÿ“… 2026-03-31
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the inefficiency of conventional square systolic arrays in handling input-dependent and highly skewed matrices prevalent in large language models, which leads to poor hardware utilization. To overcome this limitation, the authors propose SISA, a novel architecture featuring the first scalable horizontally striped systolic array. Without increasing the number of processing units, SISA employs fine-grained partitioning and an independent scheduling mechanism to efficiently support both small or skewed matrix operations and large-scale GEMM workloads. Evaluated on representative large language model tasks, SISA achieves up to 8.52ร— speedup and reduces the energy-delay product by 93% compared to state-of-the-art monolithic systolic arrays of equivalent scale.
๐Ÿ“ Abstract
The currently dominant AI/ML workloads, such as Large Language Models (LLMs), rely on the efficient execution of General Matrix-Matrix Multiplication (GEMM) operations. Thus, most systems are equipped with dedicated matrix hardware accelerators based on square Systolic Arrays (SAs) of Processing Elements (PEs). While this organization was effective for traditional Deep Neural Networks (DNNs), LLMs introduce input-dependent and highly skewed matrices, leading to underutilized SA resources. To address this challenge, we propose SISA (Scale-In Systolic Array), a novel SA architecture that partitions the traditional square array into horizontal rectangular slabs. With minimal overhead, SISA exposes parallelism through independently scheduled slabs for efficient execution of small or skewed matrix shapes, while retaining full-array operation for large GEMMs. SISA achieves up to 8.52x speedup and 93% energy-delay-product (EDP) reduction for representative LLMs compared to a state-of-the-art monolithic SA with the same number of PEs.
Problem

Research questions and friction points this paper is trying to address.

GEMM
Systolic Array
LLMs
Matrix Skew
Hardware Underutilization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Systolic Array
GEMM Acceleration
Scale-In Architecture
Matrix Shape Skew
Energy-Delay Product
๐Ÿ”Ž Similar Papers
No similar papers found.
L
Luigi Altamura
Department of Computer Science and Engineering, Chalmers University of Technology and University of Gothenburg, Gothenburg, Sweden
A
Alessio Cicero
Department of Computer Science and Engineering, Chalmers University of Technology and University of Gothenburg, Gothenburg, Sweden
M
Mateo Vรกzquez Maceiras
Department of Computer Science and Engineering, Chalmers University of Technology and University of Gothenburg, Gothenburg, Sweden
Mohammad Ali Maleki
Mohammad Ali Maleki
Postdoc researcher, Chalmers University of Technology
Hardware AccelerationSoftware-Hardware Co-DesignEnergy-efficient Design
Pedro Trancoso
Pedro Trancoso
Department of Computer Science and Engineering, Chalmers University of Technology
Computer ArchitectureMany-core ProcessorsIn-Memory ComputingApproximate Computing.