🤖 AI Summary
To address high latency and excessive memory consumption in large language model (LLM) inference, this paper proposes W4A8—a hardware-friendly dual-precision quantization paradigm: weights are stored as 4-bit integers, while activations and computations use 8-bit floating-point (FP8) arithmetic, jointly optimizing storage efficiency and computational performance without compromising accuracy. We introduce Dual-Precision Quantization (DPQ), an overhead-free quantization algorithm that employs hardware-aware post-training calibration to eliminate the latency overhead typically incurred by conventional accuracy-compensation techniques. DPQ requires no hardware modifications and is compatible with diverse modern AI accelerators. Experimental results demonstrate up to 35–62% higher inference throughput, 75% reduction in weight memory footprint, and less than 0.5% accuracy degradation relative to FP16 baselines—achieving up to 1.8× speedup.
📝 Abstract
Deep neural networks have achieved state-of-the-art results in a wide range of applications, from natural language processing and computer vision to speech recognition. However, as tasks become increasingly complex, model sizes continue to grow, posing challenges in latency and memory efficiency. To meet these constraints, post-training quantization has emerged as a promising solution. In this paper, we propose a novel hardware-efficient quantization and inference scheme that exploits hardware advantages with minimal accuracy degradation. Specifically, we introduce a W4A8 scheme, where weights are quantized and stored using 4-bit integer precision, and inference computations are performed using 8-bit floating-point arithmetic, demonstrating significant speedups and improved memory utilization compared to 16-bit operations, applicable on various modern accelerators. To mitigate accuracy loss, we develop a novel quantization algorithm, dubbed Dual Precision Quantization (DPQ), that leverages the unique structure of our scheme without introducing additional inference overhead. Experimental results demonstrate improved performance (i.e., increased throughput) while maintaining tolerable accuracy degradation relative to the full-precision model.