LUT-LLM: Efficient Large Language Model Inference with Memory-based Computations on FPGAs

📅 2025-11-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the low single-batch inference efficiency of large language models (LLMs) on FPGAs and the poor hardware utilization of arithmetic-intensive operations, this work proposes a memory-centric, lookup-table (LUT)-based inference paradigm: computationally intensive operations are replaced by vector-quantized lookups in on-chip memory, integrated with joint activation-weight quantization and a spatio-temporal hybrid architecture—significantly reducing memory bandwidth and cache pressure. We present the first fully on-chip inference implementation of a >1B-parameter model (Qwen3-1.7B) on AMD V80 FPGA. Experiments show 1.66× lower latency than AMD MI210 and 1.72× higher energy efficiency than NVIDIA A100; scaling to 32B models retains a 2.16× energy-efficiency advantage. Key innovations include an FPGA-aware, low-latency LUT mechanism and a bandwidth-aware parallel centroid search design.

Technology Category

Application Category

📝 Abstract
The rapid progress of large language models (LLMs) has advanced numerous applications, yet efficient single-batch inference remains vital for on-device intelligence. While FPGAs offer fine-grained data control and high energy efficiency, recent GPU optimizations have narrowed their advantage, especially under arithmetic-based computation. To overcome this, we leverage FPGAs'abundant on-chip memory to shift LLM inference from arithmetic- to memory-based computation through table lookups. We present LUT-LLM, the first FPGA accelerator enabling 1B+ LLM inference via vector-quantized memory operations. Our analysis identifies activation-weight co-quantization as the most effective scheme, supported by (1) bandwidth-aware parallel centroid search, (2) efficient 2D table lookups, and (3) a spatial-temporal hybrid design minimizing data caching. Implemented on an AMD V80 FPGA for a customized Qwen 3 1.7B model, LUT-LLM achieves 1.66x lower latency than AMD MI210 and 1.72x higher energy efficiency than NVIDIA A100, scaling to 32B models with 2.16x efficiency gain over A100.
Problem

Research questions and friction points this paper is trying to address.

Enabling efficient large language model inference on FPGAs using memory-based computations
Overcoming arithmetic computation limitations through lookup table operations
Achieving higher energy efficiency and lower latency than GPU alternatives
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses memory-based computation via table lookups
Implements vector-quantized memory operations for LLMs
Employs spatial-temporal hybrid design minimizing data caching
🔎 Similar Papers
No similar papers found.
Zifan He
Zifan He
University of California - Los Angeles
FPGAHPCMachine Learning
S
Shengyu Ye
Microsoft Research Asia
R
Rui Ma
Microsoft Research Asia
Y
Yang Wang
Microsoft Research Asia
J
Jason Cong
University of California, Los Angeles