🤖 AI Summary
To address the significant accuracy degradation in 4-bit post-training quantization of large language models (LLMs) caused by extreme outliers in activation tensors, this paper proposes a low-rank residual mixed-precision quantization method. Our approach comprises three key components: (i) a novel PCA-based low-rank high-fidelity subspace preservation mechanism—retaining only 1/8 of the hidden dimension—which we theoretically prove yields the optimal mixed-precision solution minimizing quantization error; (ii) invariant random rotation to suppress outlier impact; and (iii) joint 4-bit primary quantization with 8-bit residual compensation in the low-rank subspace. Evaluated on Llama and Qwen2.5 families, our method reduces Wikitext perplexity by 33% over state-of-the-art methods (e.g., SpinQuant) and achieves 3× inference speedup relative to FP16 baselines, without compromising model generalization.
📝 Abstract
Post-training quantization (PTQ) of large language models (LLMs) holds the promise in reducing the prohibitive computational cost at inference time. Quantization of all weight, activation and key-value (KV) cache tensors to 4-bit without significantly degrading generalizability is challenging, due to the high quantization error caused by extreme outliers in activations. To tackle this problem, we propose ResQ, a PTQ method that pushes further the state-of-the-art. By means of principal component analysis (PCA), it identifies a low-rank subspace (in practice 1/8 of the hidden dimension) in which activation variances are highest, and keep the coefficients within this subspace in high precision, e.g. 8-bit, while quantizing the rest to 4-bit. Within each subspace, invariant random rotation is applied to further suppress outliers. We show that this is a provably optimal mixed precision quantization scheme that minimizes error. With the Llama and Qwen2.5 families of models, we demonstrate that ResQ outperforms recent uniform and mixed precision PTQ methods on a variety of benchmarks, achieving up to 33% lower perplexity on Wikitext than the next best method SpinQuant, and upto 3 imes speedup over 16-bit baseline. Code is available at https://github.com/utkarsh-dmx/project-resq.