RAViT: Resolution-Adaptive Vision Transformer

πŸ“… 2026-02-27
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the high computational cost of Vision Transformers, which hinders their deployment in resource-constrained settings. To this end, the authors propose a multi-branch Vision Transformer architecture that processes multi-scale image inputs in parallel and incorporates a resolution-adaptive strategy along with an early-exit mechanism to dynamically balance accuracy and computational expense during inference. The proposed method achieves classification accuracy comparable to that of standard Vision Transformers on CIFAR-10, Tiny ImageNet, and ImageNet, while reducing computational cost to approximately 70% of the original FLOPs, thereby significantly improving efficiency.

Technology Category

Application Category

πŸ“ Abstract
Vision transformers have recently made a breakthrough in computer vision showing excellent performance in terms of precision for numerous applications. However, their computational cost is very high compared to alternative approaches such as Convolutional Neural Networks. To address this problem, we propose a novel framework for image classification called RAViT based on a multi-branch network that operates on several copies of the same image with different resolutions to reduce the computational cost while preserving the overall accuracy. Furthermore, our framework includes an early exit mechanism that makes our model adaptive and allows to choose the appropriate trade-off between accuracy and computational cost at run-time. For example in a two-branch architecture, the original image is first resized to reduce its resolution, then a prediction is performed on it using a first transformer and the resulting prediction is reused together with the original-size image to perform a final prediction on a second transformer with less computation than a classical Vision transformer architecture. The early-exit process allows the model to make a final prediction at intermediate branches, saving even more computation. We evaluated our approach on CIFAR-10, Tiny ImageNet, and ImageNet. We obtained an equivalent accuracy to the classical Vision transformer model with only around 70% of FLOPs.
Problem

Research questions and friction points this paper is trying to address.

Vision Transformer
computational cost
image classification
model efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Resolution-Adaptive
Vision Transformer
Multi-branch Network
Early Exit
Computational Efficiency
πŸ”Ž Similar Papers
No similar papers found.