🤖 AI Summary
To address three critical bottlenecks in Processing-in-DRAM (PIM): static precision, high latency, and low parallelism utilization, this paper proposes a dynamic-precision bit-serial computing architecture. It introduces the first DRAM-native parallel primitive execution mechanism, synergized with narrow-value-driven dynamic bit-width compression to enable fine-grained precision adaptivity and operation-level latency hiding. The hardware design integrates DRAM analog characteristics, bit-serial arithmetic, dynamic precision pruning, bank-level parallel scheduling, and dedicated narrow-value detection units. Experimental results demonstrate that, per DRAM bank, the architecture achieves average energy efficiency improvements of 90.3×, 21×, and 8.1× over CPU, GPU, and SIMDRAM baselines, respectively; incurs only 1.6% area overhead relative to the DRAM chip and 0.03% relative to the CPU chip; and delivers 17×, 7.3×, and 10.2× higher performance density than CPU, GPU, and SIMDRAM, respectively.
📝 Abstract
Processing-using-DRAM (PUD) is a paradigm where the analog operational properties of DRAM structures are used to perform bulk logic operations. While PUD promises high throughput at low energy and area cost, we uncover three limitations of existing PUD approaches that lead to significant inefficiencies: (i) static data representation, i.e., 2's complement with fixed bit-precision, leading to unnecessary computation over useless (i.e., inconsequential) data; (ii) support for only throughput-oriented execution, where the high latency of individual PUD operations can only be hidden in the presence of bulk data-level parallelism; and (iii) high latency for high-precision (e.g., 32-bit) operations. To address these issues, we propose Proteus, which builds on two key ideas. First, Proteus parallelizes the execution of independent primitives in a PUD operation by leveraging DRAM's internal parallelism. Second, Proteus reduces the bit-precision for PUD operations by leveraging narrow values (i.e., values with many leading zeros). We compare Proteus to different state-of-the-art computing platforms (CPU, GPU, and the SIMDRAM PUD architecture) for twelve real-world applications. Using a single DRAM bank, Proteus provides (i) 17x, 7.3x, and 10.2x the performance per mm2; and (ii) 90.3x, 21x, and 8.1x lower energy consumption than that of the CPU, GPU, and SIMDRAM, respectively, on average across twelve real-world applications. Proteus incurs low area cost on top of a DRAM chip (1.6%) and CPU die (0.03%).