๐ค AI Summary
Low-bit quantization of Vision Transformers (ViTs) faces challenges including severe accuracy degradation under post-training quantization (PTQ), high computational overhead and poor generalizability in quantization-aware training (QAT), and training instability. This paper proposes GPLQ, a lightweight 4-bit quantization framework featuring a novel โactivation-first, weight-laterโ two-stage strategy: in Stage I, FP32 weights are frozen while activation quantization is optimized in a single pass, preserving the optimization basin geometry; in Stage II, weights are quantized via efficient PTQ-style calibration. A feature imitation loss is introduced to enhance transfer robustness across tasks. We release an open-source, multi-task-compatible toolchain. Compared to state-of-the-art QAT methods, GPLQ accelerates training by 100ร and reduces GPU memory consumption below that of FP32 training. On ImageNet, fine-grained classification, and object detection, 4-bit ViTs quantized by GPLQ achieve performance nearly on par with their FP32 counterparts.
๐ Abstract
Vision Transformers (ViTs) are essential in computer vision but are computationally intensive, too. Model quantization, particularly to low bit-widths like 4-bit, aims to alleviate this difficulty, yet existing Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT) methods exhibit significant limitations. PTQ often incurs substantial accuracy drop, while QAT achieves high accuracy but suffers from prohibitive computational costs, limited generalization to downstream tasks, training instability, and lacking of open-source codebase. To address these challenges, this paper introduces General, Practical, and Lightning Quantization (GPLQ), a novel framework designed for efficient and effective ViT quantization. GPLQ is founded on two key empirical insights: the paramount importance of activation quantization and the necessity of preserving the model's original optimization ``basin'' to maintain generalization. Consequently, GPLQ employs a sequential ``activation-first, weights-later'' strategy. Stage 1 keeps weights in FP32 while quantizing activations with a feature mimicking loss in only 1 epoch to keep it stay in the same ``basin'', thereby preserving generalization. Stage 2 quantizes weights using a PTQ method. As a result, GPLQ is 100x faster than existing QAT methods, lowers memory footprint to levels even below FP32 training, and achieves 4-bit model performance that is highly competitive with FP32 models in terms of both accuracy on ImageNet and generalization to diverse downstream tasks, including fine-grained visual classification and object detection. We will release an easy-to-use open-source toolkit supporting multiple vision tasks.