OAC: Output-adaptive Calibration for Accurate Post-training Quantization

📅 2024-05-23
🏛️ AAAI Conference on Artificial Intelligence
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the severe accuracy degradation in low-bit post-training quantization (PTQ) of large language models (LLMs) caused by neglecting output-level effects, this paper proposes an output-adaptive calibration framework. Methodologically, it departs from conventional layer-wise Euclidean distance minimization by modeling quantization error directly as output cross-entropy loss. It introduces an efficient approximation of the output-adaptive Hessian matrix—balancing accuracy and computational efficiency—and leverages this Hessian to identify weight sensitivity and guide calibration optimization. Empirically, the method achieves state-of-the-art performance under extreme quantization regimes (2-bit and binary), significantly outperforming SpQR and BiLLM without fine-tuning. It preserves high inference accuracy while enabling deployment on resource-constrained edge devices.

Technology Category

Application Category

📝 Abstract
Deployment of Large Language Models (LLMs) has major computational costs, due to their rapidly expanding size. Compression of LLMs reduces the memory footprint, latency, and energy required for their inference. Post-training Quantization (PTQ) techniques have been developed to compress LLMs while avoiding expensive re-training. Most PTQ approaches formulate the quantization error based on a layer-wise Euclidean loss, ignoring the model output. Then, each layer is calibrated using its layer-wise Hessian to update the weights towards minimizing the quantization error. The Hessian is also used for detecting the most salient weights to quantization. Such PTQ approaches are prone to accuracy drop in low-precision quantization. We propose Output-adaptive Calibration (OAC) to incorporate the model output in the calibration process. We formulate the quantization error based on the distortion of the output cross-entropy loss. OAC approximates the output-adaptive Hessian for each layer under reasonable assumptions to reduce the computational complexity. The output-adaptive Hessians are used to update the weight matrices and detect the salient weights towards maintaining the model output. Our proposed method outperforms the state-of-the-art baselines such as SpQR and BiLLM, especially, at extreme low-precision (2-bit and binary) quantization.
Problem

Research questions and friction points this paper is trying to address.

Reducing computational costs of Large Language Models deployment
Improving accuracy in low-precision Post-training Quantization
Incorporating model output in quantization error calibration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Output-adaptive Calibration for PTQ accuracy
Formulates error via output cross-entropy distortion
Approximates output-adaptive Hessians efficiently
🔎 Similar Papers
No similar papers found.