🤖 AI Summary
This work addresses the incompatibility between Vision Transformers (ViT) and CNN-optimized Brain Processing Units (BPUs), which hinders leveraging their INT8 acceleration capabilities. The authors propose a hardware-adaptation method that requires no retraining: by replacing linear layers and LayerNorm in ViT with carefully designed convolutional operators, the model directly inherits original weights and supports INT8 quantization. This approach enables the first efficient deployment of ViT on BPU hardware, bridging the architectural gap between Transformers and CNN-specialized accelerators. Experiments show that DeiT-Base achieves 80.4% top-1 accuracy on ImageNet—only a 1.4% drop from the baseline—while accelerating inference by 3.8×; on a flower classification task, accuracy degradation is merely 0.5%.
📝 Abstract
With the advancement of deep learning technologies, specialized neural processing hardware such as Brain Processing Units (BPUs) have emerged as dedicated platforms for CNN acceleration, offering optimized INT8 computation capabilities for convolutional operations. Meanwhile, Vision Transformer (ViT) models, such as the Data-efficient Image Transformer (DeiT), have demonstrated superior performance and play increasingly crucial roles in computer vision tasks. However, due to the architectural mismatch between CNN-optimized hardware and Vision Transformer computation characteristics--namely, that linear layers in Transformers operate on three-dimensional data while BPU acceleration is designed for four-dimensional convolution operations-it is difficult or even impossible to leverage BPU's advantages when deploying Vision Transformers. To address this challenge, we propose a novel approach that restructures the Vision Transformer by replacing linear layers and layer normalization operations with carefully designed convolutional operators. This enables DeiT to fully utilize the acceleration capabilities of BPUs, while allowing the original weight parameters to be inherited by the restructured models without retraining or fine-tuning. To the best of our knowledge, this is the first successful deployment of Vision Transformers that fully leverages BPU classification datasets demonstrate the effectiveness of our approach. Specifically, the quantized DeiT-Base model achieves 80.4% accuracy on ImageNet, compared to the original 81.8%, while obtaining up to a 3.8* inference speedup. Our finetuned DeiT model on the flower classification dataset also achieves excellent performance, with only a 0.5% accuracy drop for the DeiT-Base model, further demonstrating the effectiveness of our method.