RILQ: Rank-Insensitive LoRA-based Quantization Error Compensation for Boosting 2-bit Large Language Model Accuracy

📅 2024-12-02
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LoRA Quantization Error Compensation (LQEC) fails under 2-bit large language model quantization due to its sensitivity to activation rank variations. Method: We propose Rank-Insensitive LQEC (RI-LQEC), the first method to identify and exploit the rank-insensitive nature of activation difference loss, enabling a robust loss function grounded in rank analysis; introduce a cross-layer collaborative optimization paradigm for multi-layer LoRA adapters, overcoming LQEC’s performance ceiling below 4 bits; and design an efficient inference framework integrating weight quantization and adapter merging. Contribution/Results: Evaluated on LLaMA-2 and LLaMA-3, RI-LQEC achieves significant accuracy gains under 2-bit quantization, attains state-of-the-art performance on fine-tuning tasks, maintains compatibility with mainstream quantizers, and incurs computational overhead comparable to standard LoRA.

Technology Category

Application Category

📝 Abstract
Low-rank adaptation (LoRA) has become the dominant method for parameter-efficient LLM fine-tuning, with LoRA-based quantization error compensation (LQEC) emerging as a powerful tool for recovering accuracy in compressed LLMs. However, LQEC has underperformed in sub-4-bit scenarios, with no prior investigation into understanding this limitation. We propose RILQ (Rank-Insensitive LoRA-based Quantization Error Compensation) to understand fundamental limitation and boost 2-bit LLM accuracy. Based on rank analysis revealing model-wise activation discrepancy loss's rank-insensitive nature, RILQ employs this loss to adjust adapters cooperatively across layers, enabling robust error compensation with low-rank adapters. Evaluations on LLaMA-2 and LLaMA-3 demonstrate RILQ's consistent improvements in 2-bit quantized inference across various state-of-the-art quantizers and enhanced accuracy in task-specific fine-tuning. RILQ maintains computational efficiency comparable to existing LoRA methods, enabling adapter-merged weight-quantized LLM inference with significantly enhanced accuracy, making it a promising approach for boosting 2-bit LLM performance. Our code is available at https://github.com/aiha-lab/RILQ.
Problem

Research questions and friction points this paper is trying to address.

Improves 2-bit LLM accuracy via rank-insensitive error compensation
Addresses underperformance of LQEC in sub-4-bit quantization scenarios
Enables efficient adapter-merged inference for weight-quantized LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Rank-insensitive LoRA for 2-bit LLMs
Layer-cooperative adapter adjustment
Efficient low-rank error compensation
🔎 Similar Papers
No similar papers found.
G
Geonho Lee
Hanyang University
J
Janghwan Lee
Hanyang University
S
S. Hong
KT
M
Minsoo Kim
Hanyang University
E
Euijai Ahn
KT
D
Duhyeuk Chang
KT
Jungwook Choi
Jungwook Choi
Hanyang University
Deep Neural NetworkQuantizationLarge Language ModelEfficient AIAI Accelerator