M-ANT: Efficient Low-bit Group Quantization for LLMs via Mathematically Adaptive Numerical Type

📅 2025-02-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address two key challenges in deploying large language models (LLMs)—accuracy degradation caused by heterogeneous distributions between weights and dynamic key-value (KV) caches under low-bit quantization, and hardware inefficiency arising from real-time quantization scheduling—this paper proposes Mathematical Adaptive Numerical Types (M-ANT). M-ANT introduces the first *modelable*, group-level adaptive numerical encoding paradigm, enabling fine-grained, unified quantization of both weights and dynamic KV caches. It jointly co-designs a real-time quantization scheduling mechanism and a customized PE-plus-systolic-array hardware architecture. Experimental results demonstrate that, compared to state-of-the-art LLM accelerators, M-ANT achieves an average 2.99× speedup (up to 4.46×) and 2.81× energy efficiency improvement (up to 4.10×), while preserving model accuracy under aggressive low-bit quantization.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are one of the most important killer computer applications. The recent algorithmic advancement proposes a fine-grained group-wise quantization for LLMs, which treats a small set (e.g., 64) of values in a tensor as a compression unit. It effectively preserves the model accuracy without retraining, and has become the standard approach to efficiently deploy LLMs. On the other hand, there are works that propose various adaptive data types to better adapt to different distributions and further reduce the required bit length for LLMs. In this work, our detailed analysis unveils a key finding that while different tensors exhibit similar distributions, small groups can have markedly different distributions. As such, the group-level diversity requires a new level of adaptivity for which existing adaptive data types fail to provide. In this paper, we propose MANT, a mathematically adaptive numeric type, featuring a more flexible encoding paradigm with a wider range of data distribution and more efficient decodingcomputation fusion mechanism to address these challenges. Based on MANT, we develop a supporting framework to assign the appropriate data type for each group adaptively. Meanwhile, the dynamically generated Key-Value (KV) caches in LLMs introduce further complexity for real-time quantization. To tackle this, we propose an efficient real-time quantization mechanism. Besides, we implement a specific processing element (PE) to efficiently support MANT and incorporate a real-time quantization unit. By integrating these components into a systolic array, MANT unifies the group-wise weight and KV cache quantization and addresses the associated challenges. Our evaluation shows achieving, on average, 2.99x (up to 4.46x) speedup and 2.81x (up to 4.10x) energy reduction to the state-of-the-art LLM accelerator.
Problem

Research questions and friction points this paper is trying to address.

Adaptive data types for LLM quantization
Real-time quantization for dynamic KV caches
Efficient group-wise weight and cache compression
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mathematically adaptive numeric type
Efficient real-time quantization mechanism
Systolic array integration for quantization
W
Weiming Hu
Shanghai Jiao Tong University, Shanghai Qi Zhi Institute
Haoyan Zhang
Haoyan Zhang
Shanghai Jiao Tong University
Computer Architecture
C
Cong Guo
Duke University
Y
Yu Feng
Shanghai Jiao Tong University, Shanghai Qi Zhi Institute
R
Renyang Guan
Shanghai Jiao Tong University, Shanghai Qi Zhi Institute
Z
Zhendong Hua
Shanghai Jiao Tong University, Shanghai Qi Zhi Institute
Z
Zihan Liu
Shanghai Jiao Tong University, Shanghai Qi Zhi Institute
Yue Guan
Yue Guan
University of California, San Diego
Model CompressionML System
Minyi Guo
Minyi Guo
IEEE Fellow, Chair Professor, Shanghai Jiao Tong University
Parallel ComputingCompiler OptimizationCloud ComputingNetworkingBig Data
Jingwen Leng
Jingwen Leng
Professor, Shanghai Jiao Tong University
Computer Architecture