🤖 AI Summary
This work addresses the challenges of 2-bit quantization-aware training (QAT) for large language models, where scalar quantization suffers from severe performance degradation and vector quantization is difficult to optimize end-to-end. The authors propose LC-QAT, a novel framework that achieves the first differentiable 2-bit vector quantization without explicit codebook lookups. By incorporating linear constraints into the quantization design, LC-QAT ensures fully differentiable forward propagation, enabling efficient end-to-end training. Combined with high-quality post-training quantization (PTQ) initialization, LC-QAT dramatically improves data efficiency, consistently outperforming state-of-the-art QAT methods across multiple large language models while using only 0.1%–10% of the original training data.
📝 Abstract
Quantization-aware training (QAT) is essential for extremely low-bit large language models (LLMs). Current QAT methods are mainly based on scalar quantization (SQ), which enables efficient optimization but suffers from severe performance degradation at 2-bit precision. On the other hand, vector quantization (VQ) provides substantially higher representational capacity, but its discrete codebook lookup prevents end-to-end training. We propose LC-QAT, a 2-bit weight-only VQ-QAT framework that represents quantized weights via a learned affine mapping over discrete vectors, which yields a high-quality PTQ initialization and enables fully differentiable end-to-end optimization without explicit codebook lookup in the training forward pass. This strong post-training initialization makes LC-QAT highly data-efficient. Experiments across diverse LLMs demonstrate that LC-QAT consistently outperforms state-of-the-art QAT methods while using only 0.1%--10% of the training data. Our results establish LC-QAT as a practical and scalable solution for extreme low-bit model deployment.