SAIL: SRAM-Accelerated LLM Inference System with Lookup-Table-based GEMV

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

260K/year

🤖 AI Summary

LLM inference on CPUs faces two key bottlenecks: insufficient support for low-precision arithmetic—due to heterogeneous optimal bitwidths across models and layers—and memory bandwidth limitations during token generation. This paper proposes an SRAM-based in-memory computing architecture that introduces a novel lookup table (LUT)-driven approach to enable efficient, arbitrary-bitwidth matrix-vector multiplication (GEMV). The design integrates batched LUT lookups, pattern-aware redundancy elimination, in-memory data-type conversion, and parallel dequantization/quantization. It requires only one new instruction and incurs just 2% hardware overhead, achieving tight coupling of computation and storage. Evaluated on ARM Neoverse-N1, the prototype delivers up to 10.7× peak speedup, 19.9× higher throughput-per-cost, and 7.04× better cost efficiency than the NVIDIA V100 GPU.

Technology Category

Application Category

📝 Abstract

Large Language Model (LLM) inference requires substantial computational resources, yet CPU-based inference remains essential for democratizing AI due to the widespread availability of CPUs compared to specialized accelerators. However, efficient LLM inference on CPUs faces two fundamental challenges: (1) existing CPU architectures struggle with low-precision arithmetic required by quantized models, where optimal bit precision varies across models and layers; and (2) the memory-bound nature of the token generation phase creates severe performance bottlenecks. To address these challenges, we propose SAIL (SRAM-Accelerated Inference of LLMs), a CPU-based inference solution that efficiently supports arbitrary bit precisions with minimal overhead. SAIL integrates three key innovations: First, we introduce Batched LUT-based General Matrix-Vector Multiplication (LUT-GEMV) with SRAM-based processing-in-memory, enabling high data reuse through lookup tables and reducing memory movement. Second, our Pattern-Aware LUT optimization identifies and exploits redundancy in input activation patterns, reducing computation cycles by 13.8%. Third, we develop an in-memory type conversion algorithm that leverages PIM's parallelism for efficient de-/quantization operations, alleviating pressure on CPU's vector units. Our architecture requires only 2% hardware overhead and a single new instruction, while maintaining dual functionality as both compute and storage units. Experimental evaluations using a modified gem5 simulator demonstrate that SAIL achieves up to 10.7x speedup and 19.9x higher tokens per dollar compared to ARM Neoverse-N1 CPU baselines, and up to 7.04x better cost efficiency than NVIDIA V100 GPUs, establishing a practical path for efficient CPU-based LLM inference.

Problem

Research questions and friction points this paper is trying to address.

Enabling efficient LLM inference on CPUs with arbitrary bit precisions

Overcoming memory bottlenecks in token generation phase

Reducing computational overhead through SRAM-based processing-in-memory

Innovation

Methods, ideas, or system contributions that make the work stand out.

SRAM-based processing-in-memory with lookup tables

Pattern-Aware LUT optimization reduces computation cycles

In-memory type conversion for efficient quantization operations

🔎 Similar Papers

RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval