LoL-PIM: Long-Context LLM Decoding with Scalable DRAM-PIM System

📅 2024-12-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In long-context large language model (LLM) inference, KV cache access is bottlenecked by memory bandwidth, leading to high latency and low throughput. Method: This paper proposes a scalable multi-node DRAM-PIM architecture featuring hardware-software co-design: (i) a novel pipelined parallelism mechanism across PIM modules; (ii) a context-length-adaptive Direct PIM Access (DPA) controller; and (iii) an MLIR-based custom compiler and hardware simulation framework enabling dynamic memory management and scheduling optimization. Contribution/Results: Experiments demonstrate that the architecture achieves up to 8.54× and 16.0× higher throughput than multi-GPU and GPU-PIM baselines, respectively, while significantly reducing end-to-end latency. It establishes a new hardware-software co-design paradigm for efficient LLM inference deployment.

Technology Category

Application Category

📝 Abstract
The expansion of large language models (LLMs) with hundreds of billions of parameters presents significant challenges to computational resources, particularly data movement and memory bandwidth. Long-context LLMs, which process sequences of tens of thousands of tokens, further increase the demand on the memory system as the complexity in attention layers and key-value cache sizes is proportional to the context length. Processing-in-Memory (PIM) maximizes memory bandwidth by moving compute to the data and can address the memory bandwidth challenges; however, PIM is not necessarily scalable to accelerate long-context LLM because of limited per-module memory capacity and the inflexibility of fixed-functional unit PIM architecture and static memory management. In this work, we propose LoL-PIM which is a multi-node PIM architecture that accelerates long context LLM through hardware-software co-design. In particular, we propose how pipeline parallelism can be exploited across a multi-PIM module while a direct PIM access (DPA) controller (or DMA for PIM) is proposed that enables dynamic PIM memory management and results in efficient PIM utilization across a diverse range of context length. We developed an MLIR-based compiler for LoL-PIM extending a commercial PIM-based compiler where the software modifications were implemented and evaluated, while the hardware changes were modeled in the simulator. Our evaluations demonstrate that LoL-PIM significantly improves throughput and reduces latency for long-context LLM inference, outperforming both multi-GPU and GPU-PIM systems (up to 8.54x and 16.0x speedup, respectively), thereby enabling more efficient deployment of LLMs in real-world applications.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Computational Resources
Memory Management
Innovation

Methods, ideas, or system contributions that make the work stand out.

LoL-PIM
multi-node in-memory computing
dynamic memory management
🔎 Similar Papers
No similar papers found.
H
Hyucksung Kwon
Hanyang University
K
Kyungmo Koo
Hanyang University
J
Janghyeon Kim
Hanyang University
Woongkyu Lee
Woongkyu Lee
Soongsil University
material scienceelectrical engineeringthin filmatomic layer depositionsemiconductor
M
Minjae Lee
Hanyang University
H
Hyungdeok Lee
Solution Advanced Technology, SK hynix
Y
Yousub Jung
Solution Advanced Technology, SK hynix
Jaehan Park
Jaehan Park
Columbia University
Analog and mixed signal circuitHardware security
Y
Yosub Song
Solution Advanced Technology, SK hynix
B
Byeongsu Yang
Solution Advanced Technology, SK hynix
H
Haerang Choi
Solution Advanced Technology, SK hynix
G
Guhyun Kim
Solution Advanced Technology, SK hynix
J
Jongsoon Won
Solution Advanced Technology, SK hynix
W
Woojae Shin
Solution Advanced Technology, SK hynix
C
Changhyun Kim
Solution Advanced Technology, SK hynix
G
Gyeongcheol Shin
Solution Advanced Technology, SK hynix
Yongkee Kwon
Yongkee Kwon
Tenstorrent
Computer architectureProgramming models
I
Ilkon Kim
Solution Advanced Technology, SK hynix
E
Euicheol Lim
Solution Advanced Technology, SK hynix
J
John Kim
KAIST
Jungwook Choi
Jungwook Choi
Hanyang University
Deep Neural NetworkQuantizationLarge Language ModelEfficient AIAI Accelerator