Block-based Symmetric Pruning and Fusion for Efficient Vision Transformers

📅 2025-07-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision Transformers (ViTs) suffer from high computational complexity in self-attention due to quadratic scaling with token sequence length. Existing token-level pruning methods independently prune query and key tokens, neglecting inter-token interactions and thus degrading accuracy. To address this, we propose a block-wise symmetric pruning and fusion framework: leveraging weight sharing between query and key projections, we prune only the upper-triangular portion of the attention matrix, enabling joint optimization; further, we introduce neighborhood-aware importance scoring and similarity-driven token fusion to explicitly model local structural and semantic correlations. The method is trained end-to-end on standard ViTs without auxiliary modules. On DeiT-T and DeiT-S, it achieves +1.3% and +2.0% top-1 accuracy on ImageNet, respectively, while reducing FLOPs by 50% and accelerating inference by 40%, outperforming state-of-the-art pruning approaches. Our core contribution is the first unified paradigm for visual token compression integrating symmetric pruning, neighborhood-aware evaluation, and similarity-based fusion.

Technology Category

Application Category

📝 Abstract
Vision Transformer (ViT) has achieved impressive results across various vision tasks, yet its high computational cost limits practical applications. Recent methods have aimed to reduce ViT's $O(n^2)$ complexity by pruning unimportant tokens. However, these techniques often sacrifice accuracy by independently pruning query (Q) and key (K) tokens, leading to performance degradation due to overlooked token interactions. To address this limitation, we introduce a novel {f Block-based Symmetric Pruning and Fusion} for efficient ViT (BSPF-ViT) that optimizes the pruning of Q/K tokens jointly. Unlike previous methods that consider only a single direction, our approach evaluates each token and its neighbors to decide which tokens to retain by taking token interaction into account. The retained tokens are compressed through a similarity fusion step, preserving key information while reducing computational costs. The shared weights of Q/K tokens create a symmetric attention matrix, allowing pruning only the upper triangular part for speed up. BSPF-ViT consistently outperforms state-of-the-art ViT methods at all pruning levels, increasing ImageNet classification accuracy by 1.3% on DeiT-T and 2.0% on DeiT-S, while reducing computational overhead by 50%. It achieves 40% speedup with improved accuracy across various ViTs.
Problem

Research questions and friction points this paper is trying to address.

Reduces Vision Transformer computational cost
Improves accuracy in token pruning
Optimizes symmetric Q/K token pruning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Jointly prunes Q/K tokens for efficiency
Uses similarity fusion to compress tokens
Symmetric attention matrix speeds up pruning
🔎 Similar Papers
No similar papers found.