🤖 AI Summary
To address high latency in large language model (LLM) inference caused by inefficient GEMM computation, weight memory access, and KV cache accesses, this paper proposes a bit-sliced compute–memory co-optimization architecture. Departing from conventional value-level optimizations, it jointly optimizes computation, weight compression, and cache access via three mechanisms: bit-sliced vector redundancy elimination, higher-order bit-level sparsity encoding, and bit-wise progressive prediction. At the hardware level, it introduces custom accelerators and memory hierarchies supporting native bit-level operations. Evaluated on 26 benchmarks, the design achieves 9.43× speedup and 31.1× energy efficiency improvement over NVIDIA A100, and reduces energy consumption by 35×, 5.2×, and 3.2× versus Spatten, FACT, and SOFA, respectively. Its core contribution is the first unified modeling of fine-grained bit-sliced redundancy and sparsity, thereby breaking the long-standing bottleneck in compute–memory co-optimization for Transformer inference.
📝 Abstract
Large language models (LLMs) face significant inference latency due to inefficiencies in GEMM operations, weight access, and KV cache access, especially in real-time scenarios. This highlights the need for a versatile compute-memory efficient accelerator. Unfortunately, existing Transformer accelerators struggle to address both aspects simultaneously, as they focus on value-level processing, missing fine-grained opportunities to optimize computation and memory collaboratively. This paper introduces MCBP, a bit-grained compute-memory efficient algorithm-hardware co-design that leverages bit-slice (BS) enabled repetitiveness and sparsity to accelerate LLM inference. MCBP features three key innovations: 1) BS-repetitiveness-enabled computation reduction (BRCR), which eliminates redundant GEMM computations via leveraging redundancy hidden among BS vectors; 2) BS-sparsity-enabled two-state coding (BSTC), which reduces weight access via exploiting significant sparsity in high-order bit-slice weight; 3) Bit-grained progressive prediction (BGPP), which reduces KV cache access by leveraging early-termination-based bit-grained prediction. These techniques, supported by custom accelerator designs, effectively alleviate the burden in GEMM, weight access, and KV cache access. Extensive experiments on 26 benchmarks show that MCBP achieves 9.43x speed up and 31.1x higher energy efficiency than Nvidia A100 GPU. Compared to SOTA Transformer accelerators, MCBP achieves 35x, 5.2x and 3.2x energy saving than Spatten, FACT and SOFA, respectively.