π€ AI Summary
Existing low-bit (<2-bit) language model quantization methods struggle to balance accuracy and scalability during training from scratch due to the homogenization of parameter sensitivity. This work proposes pQuant, which identifies and addresses the previously overlooked βparameter democratization effect.β The method decouples linear layers into a 1-bit backbone branch and a high-precision branch dedicated to sensitive parameters, guided by a feature scaling mechanism that strategically allocates sensitivity. Furthermore, it integrates sparse mixture-of-experts activation structures to enhance model capacity. pQuant substantially outperforms current quantization approaches under ultra-low-bit settings, achieving state-of-the-art accuracy while maintaining efficient inference.
π Abstract
Quantization-Aware Training from scratch has emerged as a promising approach for building efficient large language models (LLMs) with extremely low-bit weights (sub 2-bit), which can offer substantial advantages for edge deployment. However, existing methods still fail to achieve satisfactory accuracy and scalability. In this work, we identify a parameter democratization effect as a key bottleneck: the sensitivity of all parameters becomes homogenized, severely limiting expressivity. To address this, we propose pQuant, a method that decouples parameters by splitting linear layers into two specialized branches: a dominant 1-bit branch for efficient computation and a compact high-precision branch dedicated to preserving the most sensitive parameters. Through tailored feature scaling, we explicitly guide the model to allocate sensitive parameters to the high-precision branch. Furthermore, we extend this branch into multiple, sparsely-activated experts, enabling efficient capacity scaling. Extensive experiments indicate our pQuant achieves state-of-the-art performance in extremely low-bit quantization.