LC-QAT: Data-Efficient 2-Bit QAT for LLMs via Linear-Constrained Vector Quantization

📅 2026-06-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenges of 2-bit quantization-aware training (QAT) for large language models, where scalar quantization suffers from severe performance degradation and vector quantization is difficult to optimize end-to-end. The authors propose LC-QAT, a novel framework that achieves the first differentiable 2-bit vector quantization without explicit codebook lookups. By incorporating linear constraints into the quantization design, LC-QAT ensures fully differentiable forward propagation, enabling efficient end-to-end training. Combined with high-quality post-training quantization (PTQ) initialization, LC-QAT dramatically improves data efficiency, consistently outperforming state-of-the-art QAT methods across multiple large language models while using only 0.1%–10% of the original training data.

📝 Abstract

Quantization-aware training (QAT) is essential for extremely low-bit large language models (LLMs). Current QAT methods are mainly based on scalar quantization (SQ), which enables efficient optimization but suffers from severe performance degradation at 2-bit precision. On the other hand, vector quantization (VQ) provides substantially higher representational capacity, but its discrete codebook lookup prevents end-to-end training. We propose LC-QAT, a 2-bit weight-only VQ-QAT framework that represents quantized weights via a learned affine mapping over discrete vectors, which yields a high-quality PTQ initialization and enables fully differentiable end-to-end optimization without explicit codebook lookup in the training forward pass. This strong post-training initialization makes LC-QAT highly data-efficient. Experiments across diverse LLMs demonstrate that LC-QAT consistently outperforms state-of-the-art QAT methods while using only 0.1%--10% of the training data. Our results establish LC-QAT as a practical and scalable solution for extreme low-bit model deployment.

Problem

Research questions and friction points this paper is trying to address.

Quantization-aware training

2-bit quantization

Large language models

Vector quantization

Data efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

vector quantization

quantization-aware training

data-efficient