LiquidGEMM: Hardware-Efficient W4A8 GEMM Kernel for High-Performance LLM Serving

πŸ“… 2025-09-01
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the low dequantization efficiency and poor Tensor Core throughput alignment of W4A8-quantized GEMM operations for large language model inference on CUDA cores, this paper proposes LiquidGEMMβ€”a hardware-efficient kernel. Our approach introduces three key innovations: (1) LiquidQuant, a quantization scheme enabling safe, low-overhead dequantization of four weights in just two instructions; (2) an implicit fine-grained pipeline that fully overlaps weight loading, dequantization, and matrix multiply-accumulate operations, eliminating software synchronization overhead; and (3) co-optimization of CUDA cores and Tensor Cores with implicit inter-warp-group pipelined scheduling. Experiments demonstrate that LiquidGEMM achieves up to 2.90Γ— peak speedup over state-of-the-art W4A8 kernels and delivers 4.94Γ— end-to-end inference acceleration. Integrated into TensorRT-LLM, it yields 1.12–1.63Γ— performance improvement across diverse models.

Technology Category

Application Category

πŸ“ Abstract
Quantization is a critical technique for accelerating LLM inference by reducing memory footprint and improving computational efficiency. Among various schemes, 4-bit weight and 8-bit activation quantization (W4A8) offers a strong balance between accuracy and performance. However, existing W4A8 GEMM kernels fall short in practice due to inefficient dequantization on CUDA Cores, which cannot keep pace with the high throughput of Tensor Cores. In this paper, we present LiquidGEMM, a hardware-efficient W4A8 GEMM kernel for efficient LLM serving. LiquidGEMM designs two key techniques: LiquidQuant, a hardware-efficient quantization method that enables fast, overflow-safe dequantization using just two arithmetic instructions per four elements; and an implicit fine-grained pipeline that fully overlaps weight loading, dequantization, and MMA across warp groups without software synchronization or redundant memory traffic. Experimental results show that LiquidGEMM achieves up to 2.90x speedup over state-of-the-art W4A8 kernels and up to 4.94x end-to-end system-level speedup. Compared to various quantized GEMM kernels in NVIDIA TensorRT-LLM, LiquidGEMM delivers 1.12-1.63x performance gains, and achieves up to 1.63x system-level speedup.
Problem

Research questions and friction points this paper is trying to address.

Optimizing W4A8 GEMM kernels for efficient LLM inference
Overcoming inefficient dequantization limitations on CUDA Cores
Enhancing hardware efficiency for high-performance LLM serving
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hardware-efficient W4A8 GEMM kernel design
LiquidQuant fast overflow-safe dequantization method
Implicit fine-grained pipeline overlapping operations
πŸ”Ž Similar Papers
No similar papers found.
H
Huanqi Hu
Shanghai Jiao Tong University
B
Bowen Xiao
ByteDance Seed
S
Shixuan Sun
Shanghai Jiao Tong University
J
Jianian Yin
ByteDance Seed
Z
Zhexi Zhang
ByteDance Seed
Xiang Luo
Xiang Luo
Nanjing University
Natural Language ProcessingTask-Oriented Dialogue
C
Chengquan Jiang
ByteDance Seed
W
Weiqi Xu
ByteDance Seed
X
Xiaoying Jia
ByteDance Seed
X
Xin Liu
ByteDance Seed
Minyi Guo
Minyi Guo
IEEE Fellow, Chair Professor, Shanghai Jiao Tong University
Parallel ComputingCompiler OptimizationCloud ComputingNetworkingBig Data