The Power of Negative Zero: Datatype Customization for Quantized Large Language Models

πŸ“… 2025-01-06
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the limited numerical representability and the trade-off between accuracy and efficiency in low-bit floating-point quantization of large language models (LLMs), this paper proposes Redundant Zero Remapping (RaZeR). RaZeR is the first method to exploit the redundant negative-zero encoding in IEEE 754 floating-point formats, semantically remapping it to predefined special values to enhance the distribution-fitting capability of FP3–FP4 for LLM weights and activations. Integrated with customized floating-point encoding, joint quantization of KV caches and weights, a pruning-affine compatible algorithm, and a bit-operation-optimized RaZeRβ†’FP16 fused GEMV kernel, RaZeR achieves 7.56Γ— higher GEMV throughput than FP16 on modern GPUs and improves decoding throughput by 2.72Γ—, while surpassing asymmetric INT quantization in accuracy.

Technology Category

Application Category

πŸ“ Abstract
Large language models (LLMs) have demonstrated remarkable performance across various machine learning tasks, quickly becoming one of the most prevalent AI workloads. Yet the substantial memory requirement of LLMs significantly hinders their deployment for end users. Post-training quantization (PTQ) serves as one of the most hardware-efficient methods to mitigate the memory and computational demands of LLMs. Although the traditional integer (INT) datatype has received widespread adoption in PTQ methods, floating-point (FP) quantization has emerged as a viable alternative thanks to its effectiveness in fitting LLM numerical distributions. However, the FP datatype in sign-magnitude binary representation contains both positive and negative zero, which constrains its representation capability, particularly under low precision (3 and 4 bits). In this paper, we extend the basic FP datatype to perform Redundant Zero Remapping (RaZeR), which remaps the negative zero FP encoding to a set of pre-defined special values to maximally utilize FP quantization encodings and to better fit LLM numerical distributions. Through careful selection of special values, RaZeR outperforms conventional asymmetric INT quantization while achieving high computational efficiency. We demonstrate that RaZeR can be seamlessly integrated with quantization algorithms for both weights and KV-cache, including advanced methods with clipping and transformations, and consistently achieve better model accuracy. Additionally, we implement a fast GEMV kernel with fused dequantization that efficiently converts the 4-bit RaZeR value to FP16 through novel bit-level manipulation. On modern GPUs, our evaluation shows that RaZeR improves the GEMV speed by up to 7.56$ imes$ compared to the FP16 implementation, while achieving up to 2.72$ imes$ speedup in the LLM decoding throughput.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Memory Optimization
Computational Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

RaZeR
Floating-point Optimization
Quantization Improvement
πŸ”Ž Similar Papers
No similar papers found.