PALUTE: Processing-In-Memory Acceleration via Lookup Table for Edge LLM Inference

📅 2026-06-07

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenges of edge-based large language model (LLM) inference, which is constrained by power and area limitations, as well as the significant overhead of dequantization and nonlinear operations in quantized inference. To overcome these issues, the authors propose a look-up table (LUT)-based in-memory computing accelerator leveraging monolithic 3D DRAM. The vertical architecture enables highly parallel and low-overhead in-memory LUT lookups, complemented by near-memory LUT generation and a cross-hierarchy scheduling strategy to efficiently support mixed-precision GEMM operations. Evaluated on Qwen3-4B with W4A4 quantization, the system achieves a throughput of 1,264 tokens per second at only 0.16 W, delivering 12.8× higher energy efficiency than CHIME and 1.6× better than FIGLUT, while attaining 2.0× the area efficiency of PIMPAL.

📝 Abstract

Large language models are increasingly deployed on edge devices with tight power and area budgets. While mixed-precision GEMM reduces arithmetic complexity, quantized inference is often dominated by dequantization and nonlinear operators. Lookup Table (LUT)-based method mitigates these costs by precomputing outputs and replacing repeated arithmetic with table lookups, but existing designs incur significant capacity and lookup-latency overheads. This paper presents PALUTE, a LUT-based Processing-In-Memory accelerator built on Monolithic 3D DRAM for efficient edge LLM inference. PALUTE enables in-DRAM LUT queries that exploit the vertical organization of M3D DRAM memory array tiles to achieve high parallelism with low area overhead. A near-memory LUT generator supports low-latency LUT generation for both GEMM and element-wise unary nonlinear operators, while a system-level tiering and scheduling strategy minimizes data movement across memory tiers. Evaluation using cycle-accurate simulation and RTL synthesis shows that PALUTE achieves 1,264 TPS end-to-end throughput at 0.16 W, improving energy efficiency by 12.8$\times$ over CHIME and 1.6$\times$ over FIGLUT, improving area efficiency by 2.0$\times$ over PIMPAL under W4A4 across Qwen3-4B models.

Problem

Research questions and friction points this paper is trying to address.

edge LLM inference

lookup table

dequantization

nonlinear operators

memory overhead

Innovation

Methods, ideas, or system contributions that make the work stand out.

Processing-in-Memory

Lookup Table (LUT)

Monolithic 3D DRAM