Dual Precision Quantization for Efficient and Accurate Deep Neural Networks Inference

📅 2025-05-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address high latency and excessive memory consumption in large language model (LLM) inference, this paper proposes W4A8—a hardware-friendly dual-precision quantization paradigm: weights are stored as 4-bit integers, while activations and computations use 8-bit floating-point (FP8) arithmetic, jointly optimizing storage efficiency and computational performance without compromising accuracy. We introduce Dual-Precision Quantization (DPQ), an overhead-free quantization algorithm that employs hardware-aware post-training calibration to eliminate the latency overhead typically incurred by conventional accuracy-compensation techniques. DPQ requires no hardware modifications and is compatible with diverse modern AI accelerators. Experimental results demonstrate up to 35–62% higher inference throughput, 75% reduction in weight memory footprint, and less than 0.5% accuracy degradation relative to FP16 baselines—achieving up to 1.8× speedup.

Technology Category

Application Category

📝 Abstract
Deep neural networks have achieved state-of-the-art results in a wide range of applications, from natural language processing and computer vision to speech recognition. However, as tasks become increasingly complex, model sizes continue to grow, posing challenges in latency and memory efficiency. To meet these constraints, post-training quantization has emerged as a promising solution. In this paper, we propose a novel hardware-efficient quantization and inference scheme that exploits hardware advantages with minimal accuracy degradation. Specifically, we introduce a W4A8 scheme, where weights are quantized and stored using 4-bit integer precision, and inference computations are performed using 8-bit floating-point arithmetic, demonstrating significant speedups and improved memory utilization compared to 16-bit operations, applicable on various modern accelerators. To mitigate accuracy loss, we develop a novel quantization algorithm, dubbed Dual Precision Quantization (DPQ), that leverages the unique structure of our scheme without introducing additional inference overhead. Experimental results demonstrate improved performance (i.e., increased throughput) while maintaining tolerable accuracy degradation relative to the full-precision model.
Problem

Research questions and friction points this paper is trying to address.

Reducing model size and latency for efficient DNN inference
Minimizing accuracy loss during low-bit quantization
Enhancing hardware utilization with mixed-precision computation
Innovation

Methods, ideas, or system contributions that make the work stand out.

W4A8 scheme for efficient quantization
Dual Precision Quantization algorithm
Hardware-efficient inference with minimal accuracy loss
🔎 Similar Papers
No similar papers found.
T
Tomer Gafni
Intel, Israel
A
A. Karnieli
Intel, Israel
Yair Hanani
Yair Hanani
Unknown affiliation