RAPID: Layer-Wise Redundancy-Aware Pruning and Importance-Driven Token Merging for Efficient ViT

📅 2026-06-06

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the high computational cost of Vision Transformers (ViTs) stemming from their self-attention mechanism, noting that existing token compression methods overlook how feature representations evolve across network depth. To remedy this, the authors propose a depth-aware, plug-and-play token compression framework that adaptively adjusts its strategy without requiring retraining: in shallow layers, tokens are pruned based on redundancy and similarity awareness, while in deeper layers, merging is guided by CLS token attention to jointly consider importance and similarity. This approach is the first to align compression strategies with the hierarchical feature evolution inherent in ViTs. Evaluated on ImageNet-1K, it significantly outperforms baselines such as ToMe and ToFu, achieving up to a 4.29% accuracy gain under extreme compression and establishing a superior accuracy-compression Pareto frontier.

📝 Abstract

Vision Transformers (ViTs) achieve strong performance but suffer from high computational costs due to quadratic self-attention complexity. Although token reduction techniques such as pruning and merging mitigate this, they typically overlook how representations evolve across network depth. We propose RAPID, a depth-aware token reduction framework that adapts reduction strategies to the layer-wise characteristics of token representations. The primary methodological contribution is a bifurcated strategy: in shallow-to-middle layers, RAPID employs a redundancy-similarity aware pruning metric to eliminate over-represented local patterns. As features transition to global semantic concepts in deeper layers, the framework shifts to an importance-similarity aware merging mechanism. This stage leverages classification (CLS) token attention weights to protect semantically critical tokens while fusing less important but similar neighbors. Empirical validation on ImageNet-1K using ViT and DeiT architectures demonstrates that RAPID establishes a superior accuracy-compression Pareto frontier compared to plug-and-play baselines such as ToMe and ToFu. RAPID is particularly robust in aggressive compression regimes, achieving up to 4.29% higher accuracy than ToMe at extreme reduction rates. Our framework provides a training-free template for optimizing vision models by aligning reduction strategies with hierarchical feature evolution.

Problem

Research questions and friction points this paper is trying to address.

Vision Transformers

token reduction

computational efficiency

layer-wise representation

self-attention complexity

Innovation

Methods, ideas, or system contributions that make the work stand out.

depth-aware token reduction

redundancy-aware pruning

importance-driven merging