ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals

📅 2024-12-18

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

🤖 AI Summary

To address the significant accuracy degradation in 4-bit post-training quantization of large language models (LLMs) caused by extreme outliers in activation tensors, this paper proposes a low-rank residual mixed-precision quantization method. Our approach comprises three key components: (i) a novel PCA-based low-rank high-fidelity subspace preservation mechanism—retaining only 1/8 of the hidden dimension—which we theoretically prove yields the optimal mixed-precision solution minimizing quantization error; (ii) invariant random rotation to suppress outlier impact; and (iii) joint 4-bit primary quantization with 8-bit residual compensation in the low-rank subspace. Evaluated on Llama and Qwen2.5 families, our method reduces Wikitext perplexity by 33% over state-of-the-art methods (e.g., SpinQuant) and achieves 3× inference speedup relative to FP16 baselines, without compromising model generalization.

Technology Category

Application Category

📝 Abstract

Post-training quantization (PTQ) of large language models (LLMs) holds the promise in reducing the prohibitive computational cost at inference time. Quantization of all weight, activation and key-value (KV) cache tensors to 4-bit without significantly degrading generalizability is challenging, due to the high quantization error caused by extreme outliers in activations. To tackle this problem, we propose ResQ, a PTQ method that pushes further the state-of-the-art. By means of principal component analysis (PCA), it identifies a low-rank subspace (in practice 1/8 of the hidden dimension) in which activation variances are highest, and keep the coefficients within this subspace in high precision, e.g. 8-bit, while quantizing the rest to 4-bit. Within each subspace, invariant random rotation is applied to further suppress outliers. We show that this is a provably optimal mixed precision quantization scheme that minimizes error. With the Llama and Qwen2.5 families of models, we demonstrate that ResQ outperforms recent uniform and mixed precision PTQ methods on a variety of benchmarks, achieving up to 33% lower perplexity on Wikitext than the next best method SpinQuant, and upto 3 imes speedup over 16-bit baseline. Code is available at https://github.com/utkarsh-dmx/project-resq.

Problem

Research questions and friction points this paper is trying to address.

Reduce computational cost of LLMs

Minimize quantization error in PTQ

Enhance model performance and speed

Innovation

Methods, ideas, or system contributions that make the work stand out.

PCA identifies low-rank subspaces

Mixed precision minimizes quantization error

Invariant rotation suppresses outlier effects

🔎 Similar Papers

LRQ: Optimizing Post-Training Quantization for Large Language Models by Learning Low-Rank Weight-Scaling Matrices

2024-07-16arXiv.orgCitations: 2

Authors to Follow