Understanding the Difficulty of Low-Precision Post-Training Quantization for LLMs

📅 2024-10-18

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This work investigates the fundamental cause of severe performance degradation in post-training quantization (PTQ) of large language models (LLMs) at ultra-low precisions (≤4 bits). We identify a structural objective mismatch: PTQ minimizes layer-wise local quantization error, which diverges from global task-loss optimization—especially under limited calibration data—leading to significantly inferior results compared to quantization-aware fine-tuning (QAT). To systematically analyze this gap, we propose a unified evaluation framework integrating inter-layer error analysis, gradient sensitivity diagnosis, and multi-granularity quantization experiments, applied across LLaMA-2 and Phi-3. Empirical results show that PTQ suffers >15 BLEU/ACC point drops at 2–4 bits, whereas QAT preserves over 95% of original performance. This study is the first to formally attribute PTQ’s intrinsic limitation at ultra-low bitwidths to objective inconsistency, thereby establishing QAT’s irreplaceable role in efficient LLM compression.

Technology Category

Application Category

📝 Abstract

Large language models of high parameter counts are computationally expensive, yet can be made much more efficient by compressing their weights to very low numerical precision. This can be achieved either through post-training quantization by minimizing local, layer-wise quantization errors, or through quantization-aware fine-tuning by minimizing the global loss function. In this study, we discovered that, under the same data constraint, the former approach nearly always fared worse than the latter, a phenomenon particularly prominent when the numerical precision is very low. We further showed that this difficulty of post-training quantization arose from stark misalignment between optimization of the local and global objective functions. Our findings explains limited utility in minimization of local quantization error and the importance of direct quantization-aware fine-tuning, in the regime of large models at very low precision.

Problem

Research questions and friction points this paper is trying to address.

Analyzing challenges in low-precision post-training quantization for LLMs

Comparing local vs global optimization in weight compression

Explaining poor performance of layer-wise quantization at very low precision

Innovation

Methods, ideas, or system contributions that make the work stand out.

Post-training quantization minimizes layer-wise errors

Quantization-aware fine-tuning optimizes global loss

Local and global objective functions misaligned

🔎 Similar Papers

No similar papers found.