Optimizing and Exploring System Performance in Compact Processing-in-Memory-based Chips

📅 2025-02-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the performance degradation in area-constrained processing-in-memory (PIM) chips—caused by insufficient on-chip memory capacity to accommodate large neural network weights—this work proposes a weight-reuse pipelined architecture coupled with a bubble-aware dynamic scheduling algorithm, enabling hardware-algorithm co-optimization. It is the first to systematically quantify the trade-off between network scale and system performance under tight area budgets. The approach significantly improves throughput and energy efficiency: achieving 2.35× higher throughput and 0.5% better energy efficiency over the baseline; delivering 56.5% of the throughput and 58.6% of the energy efficiency of a full-area design while occupying only one-third the area; outperforming state-of-the-art GPUs by 4.56× in throughput and 157× in energy efficiency; and reducing data movement energy consumption to less than 20% of total energy.

Technology Category

Application Category

📝 Abstract
Processing-in-memory (PIM) is a promising computing paradigm to tackle the"memory wall"challenge. However, PIM system-level benefits over traditional von Neumann architecture can be reduced when the memory array cannot fully store all the neural network (NN) weights. The NN size is increasing while the PIM design size cannot scale up accordingly due to area constraints. Therefore, this work targets the system performance optimization and exploration for compact PIM designs. We first analyze the impact of data movement on compact designs. Then, we propose a novel pipeline method that maximizes the reuse of NN weights to improve the throughput and energy efficiency of inference in compact chips. To further boost throughput, we introduce a scheduling algorithm to mitigate the pipeline bubble problem. Moreover, we investigate the trade-off between the network size and system performance for a compact PIM chip. Experimental results show that the proposed algorithm achieves 2.35x and 0.5% improvement in throughput and energy efficiency, respectively. Compared to the area-unlimited design, our compact chip achieves approximately 56.5% of the throughput and 58.6% of the energy efficiency while using only one-third of the chip area, along with 1.3x improvement in area efficiency. Our compact design also outperforms the modern GPU with 4.56x higher throughput and 157x better energy efficiency. Besides, our compact design uses less than 20% of the system energy for data movement as batch size scales up.
Problem

Research questions and friction points this paper is trying to address.

Optimizing system performance in compact PIM chips
Addressing memory wall challenge with PIM paradigm
Improving throughput and energy efficiency in NN inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pipeline method maximizes NN weight reuse.
Scheduling algorithm mitigates pipeline bubble issues.
Compact PIM design enhances throughput and energy efficiency.
🔎 Similar Papers
No similar papers found.