GPLQ: A General, Practical, and Lightning QAT Method for Vision Transformers

๐Ÿ“… 2025-06-13
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Low-bit quantization of Vision Transformers (ViTs) faces challenges including severe accuracy degradation under post-training quantization (PTQ), high computational overhead and poor generalizability in quantization-aware training (QAT), and training instability. This paper proposes GPLQ, a lightweight 4-bit quantization framework featuring a novel โ€œactivation-first, weight-laterโ€ two-stage strategy: in Stage I, FP32 weights are frozen while activation quantization is optimized in a single pass, preserving the optimization basin geometry; in Stage II, weights are quantized via efficient PTQ-style calibration. A feature imitation loss is introduced to enhance transfer robustness across tasks. We release an open-source, multi-task-compatible toolchain. Compared to state-of-the-art QAT methods, GPLQ accelerates training by 100ร— and reduces GPU memory consumption below that of FP32 training. On ImageNet, fine-grained classification, and object detection, 4-bit ViTs quantized by GPLQ achieve performance nearly on par with their FP32 counterparts.

Technology Category

Application Category

๐Ÿ“ Abstract
Vision Transformers (ViTs) are essential in computer vision but are computationally intensive, too. Model quantization, particularly to low bit-widths like 4-bit, aims to alleviate this difficulty, yet existing Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT) methods exhibit significant limitations. PTQ often incurs substantial accuracy drop, while QAT achieves high accuracy but suffers from prohibitive computational costs, limited generalization to downstream tasks, training instability, and lacking of open-source codebase. To address these challenges, this paper introduces General, Practical, and Lightning Quantization (GPLQ), a novel framework designed for efficient and effective ViT quantization. GPLQ is founded on two key empirical insights: the paramount importance of activation quantization and the necessity of preserving the model's original optimization ``basin'' to maintain generalization. Consequently, GPLQ employs a sequential ``activation-first, weights-later'' strategy. Stage 1 keeps weights in FP32 while quantizing activations with a feature mimicking loss in only 1 epoch to keep it stay in the same ``basin'', thereby preserving generalization. Stage 2 quantizes weights using a PTQ method. As a result, GPLQ is 100x faster than existing QAT methods, lowers memory footprint to levels even below FP32 training, and achieves 4-bit model performance that is highly competitive with FP32 models in terms of both accuracy on ImageNet and generalization to diverse downstream tasks, including fine-grained visual classification and object detection. We will release an easy-to-use open-source toolkit supporting multiple vision tasks.
Problem

Research questions and friction points this paper is trying to address.

Addresses high computational cost of Vision Transformers (ViTs)
Improves accuracy and efficiency of 4-bit quantization methods
Reduces training instability and lack of open-source QAT solutions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sequential activation-first quantization strategy
Preserves model's original optimization basin
100x faster than existing QAT methods
๐Ÿ”Ž Similar Papers
No similar papers found.
Guang Liang
Guang Liang
Nanjing University
Deep learning architectures
Xinyao Liu
Xinyao Liu
University of Science and Technology of China
Computer VisionLarge Language Model
J
Jianxin Wu
State Key Laboratory for Novel Software Technology, Nanjing University, China; School of Artificial Intelligence, Nanjing University, China