TrIM, Triangular Input Movement Systolic Array for Convolutional Neural Networks: Dataflow and Analytical Modelling

📅 2024-08-02

🏛️ IEEE Transactions on Circuits and Systems for Artificial Intelligence

📈 Citations: 2

✨ Influential: 0

career value

212K/year

🤖 AI Summary

To address the von Neumann bottleneck in CNN acceleration—manifesting as high data-movement energy consumption, low processing-element (PE) utilization, and data redundancy—this paper proposes the Triangular Input-Stationary Mapping (TrIM) dataflow. TrIM enables high local input reuse and zero-weight transmission on systolic arrays while avoiding the on-chip storage overhead inherent in row-stationary dataflows. It is the first dataflow design to jointly achieve high data reuse and low redundancy, significantly reducing register requirements and memory accesses without compromising throughput. Experimental results demonstrate that, compared to state-of-the-art dataflows, TrIM reduces memory accesses by approximately 10×, improves throughput by up to 81.8%, and decreases register usage by up to 15.6×.

Technology Category

Application Category

📝 Abstract

In order to follow the ever-growing computational complexity and data intensity of state-of-the-art AI models, new computing paradigms are being proposed. These paradigms aim at achieving high energy efficiency, by mitigating the Von Neumann bottleneck that relates to the energy cost of moving data between the processing cores and the memory. Convolutional Neural Networks (CNNs) are susceptible to this bottleneck, given the massive data they have to manage. Systolic Arrays (SAs) are promising architectures to mitigate the data transmission cost, thanks to high data utilization of Processing Elements (PEs). These PEs continuously exchange and process data locally based on specific dataflows (like weight stationary and row stationary), in turn reducing the number of memory accesses to the main memory. In SAs, convolutions are managed either as matrix multiplications or exploiting the raster-order scan of sliding windows. However, data redundancy is a primary concern affecting area, power and energy. In this paper, we propose TrIM: a novel dataflow for SAs based on a Triangular Input Movement and compatible with CNN computing. TrIM maximizes the local input utilization, minimizes the weight data movement and solves the data redundancy problem. Furthermore, TrIM does not incur the significant on-chip memory penalty introduced by the row stationary dataflow. When compared to state-of-the-art SA dataflows the high data utilization offered by TrIM guarantees ~10x less memory access. Furthermore, considering that PEs continuously overlap multiplications and accumulations, TrIM achieves high throughput (up to 81.8% higher than row stationary), other than requiring a limited number of registers (up to 15.6x fewer registers than row stationary).

Problem

Research questions and friction points this paper is trying to address.

Mitigates Von Neumann bottleneck in CNNs with systolic arrays

Reduces data redundancy in convolutional neural networks

Improves energy efficiency and throughput in AI models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Triangular Input Movement dataflow for CNNs

Maximizes local input utilization

Minimizes weight data movement

🔎 Similar Papers

TrIM, Triangular Input Movement Systolic Array for Convolutional Neural Networks: Architecture and Hardware Implementation