ViT-AdaLA: Adapting Vision Transformers with Linear Attention

📅 2026-03-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of scaling Vision Transformers (ViTs) to long-sequence tasks, where the quadratic complexity of self-attention becomes prohibitive and existing linear attention methods struggle to effectively transfer knowledge from pretrained ViTs. To overcome this limitation, we propose ViT-AdaLA, a three-stage knowledge transfer framework that seamlessly migrates knowledge from a pretrained softmax-based ViT to a linear attention architecture through attention alignment, feature distillation, and supervised fine-tuning. ViT-AdaLA is the first method to enable end-to-end knowledge transfer from standard ViTs to their linearized counterparts, effectively mitigating error accumulation across layers. Experimental results demonstrate that ViT-AdaLA significantly outperforms existing linear attention approaches on both image classification and segmentation tasks, achieving high performance while maintaining strong generalization capabilities.

Technology Category

Application Category

📝 Abstract
Vision Transformers (ViTs) based vision foundation models (VFMs) have achieved remarkable performance across diverse vision tasks, but suffer from quadratic complexity that limits scalability to long sequences. Existing linear attention approaches for ViTs are typically trained from scratch, requiring substantial computational resources, while linearization-based methods developed for large language model decoders do not transfer well to ViTs. To address these challenges, we propose ViT-AdaLA, a novel framework for effectively adapting and transferring prior knowledge from VFMs to linear attention ViTs. ViT-AdaLA consists of three stages: attention alignment, feature alignment, and supervised fine-tuning. In the attention alignment stage, we align vanilla linear attention with the original softmax-based attention in each block to approximate the behavior of softmax attention. However, residual approximation errors inevitably accumulate across layers. We mitigate this by fine-tuning the linearized ViT to align its final-layer features with a frozen softmax VFM teacher. Finally, the adapted prior knowledge is transferred to downstream tasks through supervised fine-tuning. Extensive experiments on classification and segmentation tasks demonstrate the effectiveness and generality of ViT-AdaLA over various state-of-the-art linear attention counterpart.
Problem

Research questions and friction points this paper is trying to address.

Vision Transformers
linear attention
quadratic complexity
knowledge transfer
scalability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision Transformers
Linear Attention
Knowledge Transfer
Attention Alignment
Feature Alignment
🔎 Similar Papers
No similar papers found.