GPLQ: A General, Practical, and Lightning QAT Method for Vision Transformers

📅 2025-06-13

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Low-bit quantization of Vision Transformers (ViTs) faces challenges including severe accuracy degradation under post-training quantization (PTQ), high computational overhead and poor generalizability in quantization-aware training (QAT), and training instability. This paper proposes GPLQ, a lightweight 4-bit quantization framework featuring a novel “activation-first, weight-later” two-stage strategy: in Stage I, FP32 weights are frozen while activation quantization is optimized in a single pass, preserving the optimization basin geometry; in Stage II, weights are quantized via efficient PTQ-style calibration. A feature imitation loss is introduced to enhance transfer robustness across tasks. We release an open-source, multi-task-compatible toolchain. Compared to state-of-the-art QAT methods, GPLQ accelerates training by 100× and reduces GPU memory consumption below that of FP32 training. On ImageNet, fine-grained classification, and object detection, 4-bit ViTs quantized by GPLQ achieve performance nearly on par with their FP32 counterparts.

Technology Category

Application Category

📝 Abstract

Vision Transformers (ViTs) are essential in computer vision but are computationally intensive, too. Model quantization, particularly to low bit-widths like 4-bit, aims to alleviate this difficulty, yet existing Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT) methods exhibit significant limitations. PTQ often incurs substantial accuracy drop, while QAT achieves high accuracy but suffers from prohibitive computational costs, limited generalization to downstream tasks, training instability, and lacking of open-source codebase. To address these challenges, this paper introduces General, Practical, and Lightning Quantization (GPLQ), a novel framework designed for efficient and effective ViT quantization. GPLQ is founded on two key empirical insights: the paramount importance of activation quantization and the necessity of preserving the model's original optimization ``basin'' to maintain generalization. Consequently, GPLQ employs a sequential ``activation-first, weights-later'' strategy. Stage 1 keeps weights in FP32 while quantizing activations with a feature mimicking loss in only 1 epoch to keep it stay in the same ``basin'', thereby preserving generalization. Stage 2 quantizes weights using a PTQ method. As a result, GPLQ is 100x faster than existing QAT methods, lowers memory footprint to levels even below FP32 training, and achieves 4-bit model performance that is highly competitive with FP32 models in terms of both accuracy on ImageNet and generalization to diverse downstream tasks, including fine-grained visual classification and object detection. We will release an easy-to-use open-source toolkit supporting multiple vision tasks.

Problem

Research questions and friction points this paper is trying to address.

Addresses high computational cost of Vision Transformers (ViTs)

Improves accuracy and efficiency of 4-bit quantization methods

Reduces training instability and lack of open-source QAT solutions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sequential activation-first quantization strategy

Preserves model's original optimization basin

100x faster than existing QAT methods

🔎 Similar Papers

Non-autoregressive Sequence-to-Sequence Vision-Language Models