Sangam: Chiplet-Based DRAM-PIM Accelerator with CXL Integration for LLM Inferencing

📅 2025-11-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address memory bandwidth bottlenecks and excessive KV cache growth in large language model (LLM) inference, this work proposes a DRAM-based processing-in-memory (PIM) accelerator built on a chiplet architecture. The design leverages the CXL interface for either GPU-coordinated or standalone deployment, and innovatively decouples logic and memory using heterogeneous process technologies—integrating energy-efficient systolic arrays and SRAM buffers near DRAM banks to overcome conventional limitations on compute density and capacity within DRAM. Methodologically, it synergistically combines PIM execution, high-bandwidth chiplet interconnects, and customized GEMM/GEMV optimizations tailored for LLM decoding. Evaluated on LLaMA-7B and Mistral-7B, the accelerator achieves up to 4.22× end-to-end latency reduction, 6.36× improvement in decoding throughput, and order-of-magnitude gains in energy efficiency.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are becoming increasingly data-intensive due to growing model sizes, and they are becoming memory-bound as the context length and, consequently, the key-value (KV) cache size increase. Inference, particularly the decoding phase, is dominated by memory-bound GEMV or flat GEMM operations with low operational intensity (OI), making it well-suited for processing-in-memory (PIM) approaches. However, existing in/near-memory solutions face critical limitations such as reduced memory capacity due to the high area cost of integrating processing elements (PEs) within DRAM chips, and limited PE capability due to the constraints of DRAM fabrication technology. This work presents a chiplet-based memory module that addresses these limitations by decoupling logic and memory into chiplets fabricated in heterogeneous technology nodes and connected via an interposer. The logic chiplets sustain high bandwidth access to the DRAM chiplets, which house the memory banks, and enable the integration of advanced processing components such as systolic arrays and SRAM-based buffers to accelerate memory-bound GEMM kernels, capabilities that were not feasible in prior PIM architectures. We propose Sangam, a CXL-attached PIM-chiplet based memory module that can either act as a drop-in replacement for GPUs or co-executes along side the GPUs. Sangam achieves speedup of 3.93, 4.22, 2.82x speedup in end-to-end query latency, 10.3, 9.5, 6.36x greater decoding throughput, and order of magnitude energy savings compared to an H100 GPU for varying input size, output length, and batch size on LLaMA 2-7B, Mistral-7B, and LLaMA 3-70B, respectively.
Problem

Research questions and friction points this paper is trying to address.

Addressing memory bottlenecks in LLM inference caused by large KV caches
Overcoming capacity and capability limitations of existing PIM architectures
Accelerating memory-bound GEMM operations through chiplet-based DRAM-PIM design
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chiplet-based DRAM-PIM accelerator with CXL integration
Decouples logic and memory chiplets via interposer connection
Uses systolic arrays and SRAM buffers for GEMM acceleration
K
Khyati Kiyawat
University of Virginia
Zhenxing Fan
Zhenxing Fan
University of Virginia
Computer Architecture
Y
Yasas Seneviratne
University of Virginia
M
Morteza Baradaran
University of Virginia
A
Akhil Shekar
University of Virginia
Zihan Xia
Zihan Xia
University of California, San Diego
M
Mingu Kang
University of California, San Diego
Kevin Skadron
Kevin Skadron
Harry Douglas Forsyth Professor of Computer Science, University of Virginia
computer architectureprocessing in memoryhardware accelerationautomata processingheterogeneous computing