MCBP: A Memory-Compute Efficient LLM Inference Accelerator Leveraging Bit-Slice-enabled Sparsity and Repetitiveness

📅 2025-09-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address high latency in large language model (LLM) inference caused by inefficient GEMM computation, weight memory access, and KV cache accesses, this paper proposes a bit-sliced compute–memory co-optimization architecture. Departing from conventional value-level optimizations, it jointly optimizes computation, weight compression, and cache access via three mechanisms: bit-sliced vector redundancy elimination, higher-order bit-level sparsity encoding, and bit-wise progressive prediction. At the hardware level, it introduces custom accelerators and memory hierarchies supporting native bit-level operations. Evaluated on 26 benchmarks, the design achieves 9.43× speedup and 31.1× energy efficiency improvement over NVIDIA A100, and reduces energy consumption by 35×, 5.2×, and 3.2× versus Spatten, FACT, and SOFA, respectively. Its core contribution is the first unified modeling of fine-grained bit-sliced redundancy and sparsity, thereby breaking the long-standing bottleneck in compute–memory co-optimization for Transformer inference.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) face significant inference latency due to inefficiencies in GEMM operations, weight access, and KV cache access, especially in real-time scenarios. This highlights the need for a versatile compute-memory efficient accelerator. Unfortunately, existing Transformer accelerators struggle to address both aspects simultaneously, as they focus on value-level processing, missing fine-grained opportunities to optimize computation and memory collaboratively. This paper introduces MCBP, a bit-grained compute-memory efficient algorithm-hardware co-design that leverages bit-slice (BS) enabled repetitiveness and sparsity to accelerate LLM inference. MCBP features three key innovations: 1) BS-repetitiveness-enabled computation reduction (BRCR), which eliminates redundant GEMM computations via leveraging redundancy hidden among BS vectors; 2) BS-sparsity-enabled two-state coding (BSTC), which reduces weight access via exploiting significant sparsity in high-order bit-slice weight; 3) Bit-grained progressive prediction (BGPP), which reduces KV cache access by leveraging early-termination-based bit-grained prediction. These techniques, supported by custom accelerator designs, effectively alleviate the burden in GEMM, weight access, and KV cache access. Extensive experiments on 26 benchmarks show that MCBP achieves 9.43x speed up and 31.1x higher energy efficiency than Nvidia A100 GPU. Compared to SOTA Transformer accelerators, MCBP achieves 35x, 5.2x and 3.2x energy saving than Spatten, FACT and SOFA, respectively.
Problem

Research questions and friction points this paper is trying to address.

Reducing LLM inference latency from inefficient GEMM operations
Minimizing weight access inefficiencies through bit-level sparsity
Decreasing KV cache access via bit-grained prediction techniques
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bit-slice repetitiveness reduces redundant GEMM computations
Bit-slice sparsity enables efficient two-state weight coding
Bit-grained progressive prediction minimizes KV cache access
🔎 Similar Papers
No similar papers found.
Huizheng Wang
Huizheng Wang
Tsinghua University
Sparse AttentionLLM acceleratorAI InfraDistrbited ParallelismVLSI
Z
Zichuan Wang
Tsinghua University, School of Integrated Circuits
Z
Zhiheng Yue
Tsinghua University, School of Integrated Circuits
Y
Yousheng Long
Tsinghua University, School of Integrated Circuits
T
Taiquan Wei
Tsinghua University, School of Integrated Circuits
Jianxun Yang
Jianxun Yang
Tsinghua University, School of Integrated Circuits
Y
Yang Wang
Tsinghua University, School of Integrated Circuits
C
Chao Li
Shanghai Jiao Tong University, Department of Computer Science and Engineering
Shaojun Wei
Shaojun Wei
Professor, Tsinghua University
Y
Yang Hu
Tsinghua University, School of Integrated Circuits
Shouyi Yin
Shouyi Yin
Tsinghua University