Dual Precision Quantization for Efficient and Accurate Deep Neural Networks Inference

📅 2025-05-20

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

To address high latency and excessive memory consumption in large language model (LLM) inference, this paper proposes W4A8—a hardware-friendly dual-precision quantization paradigm: weights are stored as 4-bit integers, while activations and computations use 8-bit floating-point (FP8) arithmetic, jointly optimizing storage efficiency and computational performance without compromising accuracy. We introduce Dual-Precision Quantization (DPQ), an overhead-free quantization algorithm that employs hardware-aware post-training calibration to eliminate the latency overhead typically incurred by conventional accuracy-compensation techniques. DPQ requires no hardware modifications and is compatible with diverse modern AI accelerators. Experimental results demonstrate up to 35–62% higher inference throughput, 75% reduction in weight memory footprint, and less than 0.5% accuracy degradation relative to FP16 baselines—achieving up to 1.8× speedup.

Technology Category

Application Category

📝 Abstract

Deep neural networks have achieved state-of-the-art results in a wide range of applications, from natural language processing and computer vision to speech recognition. However, as tasks become increasingly complex, model sizes continue to grow, posing challenges in latency and memory efficiency. To meet these constraints, post-training quantization has emerged as a promising solution. In this paper, we propose a novel hardware-efficient quantization and inference scheme that exploits hardware advantages with minimal accuracy degradation. Specifically, we introduce a W4A8 scheme, where weights are quantized and stored using 4-bit integer precision, and inference computations are performed using 8-bit floating-point arithmetic, demonstrating significant speedups and improved memory utilization compared to 16-bit operations, applicable on various modern accelerators. To mitigate accuracy loss, we develop a novel quantization algorithm, dubbed Dual Precision Quantization (DPQ), that leverages the unique structure of our scheme without introducing additional inference overhead. Experimental results demonstrate improved performance (i.e., increased throughput) while maintaining tolerable accuracy degradation relative to the full-precision model.

Problem

Research questions and friction points this paper is trying to address.

Reducing model size and latency for efficient DNN inference

Minimizing accuracy loss during low-bit quantization

Enhancing hardware utilization with mixed-precision computation

Innovation

Methods, ideas, or system contributions that make the work stand out.

W4A8 scheme for efficient quantization

Dual Precision Quantization algorithm

Hardware-efficient inference with minimal accuracy loss

🔎 Similar Papers

Gradient-based Automatic Mixed Precision Quantization for Neural Networks On-Chip