Hardware-Software Co-Design for Accelerating Transformer Inference Leveraging Compute-in-Memory

๐Ÿ“… 2025-02-17
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
The quadratic computational complexity (O(n^2)) of the attention mechanism in Transformer inference imposes severe energy overhead and on-chip memory bottlenecks. Method: This paper proposes a hardwareโ€“software co-designed in-memory computing (CIM) accelerator. It introduces a Unified Compute-and-Lookup Module (UCLM) that integrates exponential lookup tables and multiply-accumulate (MAC) operations within a single SRAM array, coupled with fine-grained pipelined scheduling and cross-core reduce-gather parallelism to reduce attention complexity to (O(n)). The design is implemented in TSMC 65 nm, supporting INT-8 quantization, a domain-specific compiler, and a cycle-accurate simulator. Contribution/Results: Experiments show that, versus an NVIDIA A40 GPU, the accelerator achieves 4.4โ€“9.8ร— higher end-to-end throughput and 16โ€“36ร— better energy efficiency; against a baseline CIM architecture, it delivers 1.7โ€“5.9ร— higher throughput while maintaining comparable energy efficiency.

Technology Category

Application Category

๐Ÿ“ Abstract
Transformers have become the backbone of neural network architecture for most machine learning applications. Their widespread use has resulted in multiple efforts on accelerating attention, the basic building block of transformers. This paper tackles the challenges associated with accelerating attention through a hardware-software co-design approach while leveraging compute-in-memory(CIM) architecture. In particular, our energy- and area-efficient CIM based accelerator, named HASTILY, aims to accelerate softmax computation, an integral operation in attention, and minimize their high on-chip memory requirements that grows quadratically with input sequence length. Our architecture consists of novel CIM units called unified compute and lookup modules(UCLMs) that integrate both lookup and multiply-accumulate functionality within the same SRAM array, incurring minimal area overhead over standard CIM arrays. Designed in TSMC 65nm, UCLMs can be used to concurrently perform exponential and matrix-vector multiplication operations. Complementing the proposed architecture, HASTILY features a fine-grained pipelining strategy for scheduling both attention and feed-forward layers, to reduce the quadratic dependence on sequence length to linear dependence. Further, for fast softmax computation which involves computing the maxima and sum of exponential values, such operations are parallelized across multiple cores using reduce and gather strategy. We evaluate our proposed architecture using a compiler tailored towards attention computation and a standard cycle-level CIM simulator. Our evaluation shows end-to-end throughput(TOPS) improvement of 4.4x-9.8x and 1.7x-5.9x over Nvidia A40 GPU and baseline CIM hardware, respectively, for BERT models with INT-8 precision. Additionally, it shows gains of 16x-36x in energy-efficiency(TOPS/W) over A40 GPU and similar energy-efficiency as baseline CIM hardware.
Problem

Research questions and friction points this paper is trying to address.

Accelerate Transformer inference using hardware-software co-design.
Minimize on-chip memory requirements for attention mechanisms.
Enhance energy efficiency in softmax computation with CIM architecture.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Compute-in-Memory based accelerator
Unified compute and lookup modules
Fine-grained pipelining strategy
๐Ÿ”Ž Similar Papers
No similar papers found.
Dong Eun Kim
Dong Eun Kim
Ph.D Student, Purdue University
Computer ArchitectureMachine LearningAI accelerator
T
Tanvi Sharma
Department of Electrical and Computer Engineering, Purdue University, West Lafayette, IN, 47907
K
Kaushik Roy
Department of Electrical and Computer Engineering, Purdue University, West Lafayette, IN, 47907