Shared-PIM: Enabling Concurrent Computation and Data Flow for Faster Processing-in-DRAM

📅 2024-08-28

🏛️ IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

📈 Citations: 0

✨ Influential: 0

career value

261K/year

🤖 AI Summary

To address high latency and low energy efficiency in Processing-in-Memory (PIM) architectures caused by the intrinsic compute–memory access exclusivity, this paper proposes Shared-PIM—the first DRAM-based PIM architecture enabling truly concurrent computation and data movement. Its core innovation is an intra-bank row-level resource sharing mechanism that permits parallel execution of computation and data transfer within the same DRAM bank. Combined with enhanced memory-peripheral coordination, row-buffer scheduling, dual-path data routing, and lightweight control logic, Shared-PIM achieves fine-grained compute–memory overlap. Compared to LISA, Shared-PIM reduces data movement latency by 5× and energy consumption by 1.2×, with only a 7.16% area overhead. Experimental evaluation shows performance improvements of 40%, 44%, 31%, and 29% for matrix multiplication, polynomial multiplication, number-theoretic transform (NTT), and graph traversal (BFS/DFS), respectively, significantly enhancing both energy efficiency and end-to-end latency.

Technology Category

Application Category

📝 Abstract

Processing-in-Memory (PIM) enhances memory with computational capabilities, potentially solving energy and latency issues associated with data transfer between memory and processors. However, managing concurrent computation and data flow within the PIM architecture incurs significant latency and energy penalty for applications. This paper introduces Shared-PIM, an architecture for in-DRAM PIM that strategically allocates rows in memory banks, bolstered by memory peripherals, for concurrent processing and data movement. Shared-PIM enables simultaneous computation and data transfer within a memory bank. When compared to LISA, a state-of-the-art architecture that facilitates data transfers for in-DRAM PIM, Shared-PIM reduces data movement latency and energy by 5x and 1.2x respectively. Furthermore, when integrated to a state-of-the-art (SOTA) in-DRAM PIM architecture (pLUTo), Shared-PIM achieves 1.4x faster addition and multiplication, and thereby improves the performance of matrix multiplication (MM) tasks by 40%, polynomial multiplication (PMM) by 44%, and numeric number transfer (NTT) tasks by 31%. Moreover, for graph processing tasks like Breadth-First Search (BFS) and Depth-First Search (DFS), Shared-PIM achieves a 29% improvement in speed, all with an area overhead of just 7.16% compared to the baseline pLUTo.

Problem

Research questions and friction points this paper is trying to address.

Reducing latency and energy in concurrent PIM computation and data flow

Enabling simultaneous computation and data transfer within DRAM banks

Improving performance of matrix and polynomial multiplication tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Enables concurrent computation and data transfer

Reduces data movement latency and energy

Improves performance of matrix and graph tasks

🔎 Similar Papers

PyGim : An Efficient Graph Neural Network Library for Real Processing-In-Memory Architectures