🤖 AI Summary
Extreme 2-bit quantization severely degrades the inference accuracy of large language models, hindering their deployment on edge devices. This work proposes a mixed-precision quantization strategy, termed W4/W2-GateUp, which applies 2-bit quantization exclusively to the gate and up projection layers within the MLP blocks. To efficiently recover model accuracy, it introduces, for the first time, a combination of data-free low-rank adaptation (LoRA) and logit-based knowledge distillation using synthetic data. Evaluated on Qwen3-4B with only 10,000 synthetic samples and no real annotated data, the method achieves 80–95% accuracy recovery on nine out of twelve benchmarks. Furthermore, it delivers throughput improvements of 7.5–23.3% across diverse hardware platforms.
📝 Abstract
Aggressive weight quantization to 2-bit precision offers substantial throughput and memory gains for large language model (LLM) inference, but typically incurs severe accuracy degradation. These gains are particularly relevant for edge and on-device deployment, where memory capacity and bandwidth are primary constraints. In this work, we extend Recover-LoRA -- a lightweight, data-free accuracy recovery method originally developed for general model weight corruption -- to the setting of ultra-low-bit quantization. We propose a selective mixed-precision strategy in which only gate and up projection layers of the MLP are quantized to 2-bit (W2), while all other linear layers remain at higher precision, yielding a mixed-precision GateUp configuration. We demonstrate via roofline analysis across three model families (4B--20B) and two hardware platforms that a W4/W2-GateUp deployment (4-bit base with 2-bit gate/up) delivers 7.5--23.3\% TPS improvement over uniform W4 depending on model and context length, while confining quantization error to a predictable subset of layers. We then apply Recover-LoRA -- training low-rank adapters on the quantized layers via logit distillation with synthetic data -- to recover accuracy lost from 2-bit quantization of the gate and up layers. In a case study on Qwen3-4B, Recover-LoRA achieves 80--95\% accuracy recovery on 9 of 12 benchmarks, using only 10k synthetic training samples and no labeled data. We further demonstrate that synthetic data performs comparably to curated labeled data for distillation-based recovery, and that recovery generalizes to out-of-distribution evaluation tasks. Our results present Recover-LoRA as a practical post-quantization accuracy recovery tool for aggressive weight compression in deployment settings.