pQuant: Towards Effective Low-Bit Language Models via Decoupled Linear Quantization-Aware Training

πŸ“… 2026-02-25
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing low-bit (<2-bit) language model quantization methods struggle to balance accuracy and scalability during training from scratch due to the homogenization of parameter sensitivity. This work proposes pQuant, which identifies and addresses the previously overlooked β€œparameter democratization effect.” The method decouples linear layers into a 1-bit backbone branch and a high-precision branch dedicated to sensitive parameters, guided by a feature scaling mechanism that strategically allocates sensitivity. Furthermore, it integrates sparse mixture-of-experts activation structures to enhance model capacity. pQuant substantially outperforms current quantization approaches under ultra-low-bit settings, achieving state-of-the-art accuracy while maintaining efficient inference.

Technology Category

Application Category

πŸ“ Abstract
Quantization-Aware Training from scratch has emerged as a promising approach for building efficient large language models (LLMs) with extremely low-bit weights (sub 2-bit), which can offer substantial advantages for edge deployment. However, existing methods still fail to achieve satisfactory accuracy and scalability. In this work, we identify a parameter democratization effect as a key bottleneck: the sensitivity of all parameters becomes homogenized, severely limiting expressivity. To address this, we propose pQuant, a method that decouples parameters by splitting linear layers into two specialized branches: a dominant 1-bit branch for efficient computation and a compact high-precision branch dedicated to preserving the most sensitive parameters. Through tailored feature scaling, we explicitly guide the model to allocate sensitive parameters to the high-precision branch. Furthermore, we extend this branch into multiple, sparsely-activated experts, enabling efficient capacity scaling. Extensive experiments indicate our pQuant achieves state-of-the-art performance in extremely low-bit quantization.
Problem

Research questions and friction points this paper is trying to address.

low-bit quantization
quantization-aware training
parameter democratization
large language models
model expressivity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupled Quantization
Low-Bit LLMs
Quantization-Aware Training
Parameter Sensitivity
Sparse Experts
πŸ”Ž Similar Papers
No similar papers found.
Wenzheng Zhang
Wenzheng Zhang
Rutgers University
Natural Language ProcessingDeep Learning
B
Bingzheng Liu
College of Future Information Technology, Fudan University
Y
Yang Hu
Center for Information Research, Academy of Military Sciences
Xiaoying Bai
Xiaoying Bai
Tsinghua University
Software engineeringsoftware testingservice-oriented computingcloud computing
Wentao Zhang
Wentao Zhang
Institute of Physics, Chinese Academy of Sciences
photoemissionsuperconductivitycupratehtsctime-resolved
B
Bin Cui
School of Computer Science, Peking University