pQuant: Towards Effective Low-Bit Language Models via Decoupled Linear Quantization-Aware Training

📅 2026-02-25

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing low-bit (<2-bit) language model quantization methods struggle to balance accuracy and scalability during training from scratch due to the homogenization of parameter sensitivity. This work proposes pQuant, which identifies and addresses the previously overlooked “parameter democratization effect.” The method decouples linear layers into a 1-bit backbone branch and a high-precision branch dedicated to sensitive parameters, guided by a feature scaling mechanism that strategically allocates sensitivity. Furthermore, it integrates sparse mixture-of-experts activation structures to enhance model capacity. pQuant substantially outperforms current quantization approaches under ultra-low-bit settings, achieving state-of-the-art accuracy while maintaining efficient inference.

Technology Category

Application Category

📝 Abstract

Quantization-Aware Training from scratch has emerged as a promising approach for building efficient large language models (LLMs) with extremely low-bit weights (sub 2-bit), which can offer substantial advantages for edge deployment. However, existing methods still fail to achieve satisfactory accuracy and scalability. In this work, we identify a parameter democratization effect as a key bottleneck: the sensitivity of all parameters becomes homogenized, severely limiting expressivity. To address this, we propose pQuant, a method that decouples parameters by splitting linear layers into two specialized branches: a dominant 1-bit branch for efficient computation and a compact high-precision branch dedicated to preserving the most sensitive parameters. Through tailored feature scaling, we explicitly guide the model to allocate sensitive parameters to the high-precision branch. Furthermore, we extend this branch into multiple, sparsely-activated experts, enabling efficient capacity scaling. Extensive experiments indicate our pQuant achieves state-of-the-art performance in extremely low-bit quantization.

Problem

Research questions and friction points this paper is trying to address.

low-bit quantization

quantization-aware training

parameter democratization

large language models

model expressivity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupled Quantization

Low-Bit LLMs

Quantization-Aware Training

Parameter Sensitivity

Sparse Experts

🔎 Similar Papers

No similar papers found.

Authors to Follow