Attention-aware Post-training Quantization without Backpropagation

📅 2024-06-19

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

To enable efficient deployment of large language models (LLMs) on resource-constrained devices, this paper proposes a backpropagation-free post-training quantization method that explicitly models inter-layer dependencies within the attention mechanism to enhance low-bit quantization performance. The method introduces three key innovations: (1) an attention-aware Hessian approximation that explicitly captures cross-layer interactions; (2) gradient-free, layer-wise collaborative optimization of integer-valued weights—replacing conventional rounding schemes with constrained integer programming under weight-only quantization; and (3) a co-suppression mechanism for activation outliers that jointly mitigates outlier amplification across layers. Evaluated on multiple mainstream LLMs, the approach achieves state-of-the-art weight-only quantization accuracy. When combined with activation quantization, it significantly improves inference accuracy under extremely low-bit configurations such as W4A4, outperforming prior methods in both perplexity and downstream task performance.

Technology Category

Application Category

📝 Abstract

Post-training quantization (PTQ) is a promising solution for deploying large language models (LLMs) on resource-constrained devices. Early methods developed for smaller networks like ResNet rely on gradient-based optimization, which becomes impractical for hyper-scale LLMs with billions of parameters. While recently proposed backpropagation-free or transformation-based methods alleviate this issue, their performance remains limited by either a lack of inter-layer dependency consideration or the use of naive nearest-rounding-based integer weight assignment to save the heavy computational cost of weight optimization. We thus introduce a novel backpropagation-free PTQ algorithm that optimizes integer weights by considering inter-layer dependencies. The key innovation is the development of attention-aware Hessian matrices that capture inter-layer interactions within the attention module. Extensive experiments demonstrate that our approach not only outperforms existing weight quantization methods but also shows good synergy with conventional methods to suppress activation outliers, leading to state-of-the-art weight-activation quantization performance.

Problem

Research questions and friction points this paper is trying to address.

Optimizes integer weights without backpropagation

Considers inter-layer dependencies in PTQ

Uses attention-aware Hessian matrices for interaction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Backpropagation-free PTQ algorithm

Attention-aware Hessian matrices

Inter-layer dependency optimization

🔎 Similar Papers

SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration