ToFe: Lagged Token Freezing and Reusing for Efficient Vision Transformer Inference

📅 2025-07-22

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

Deploying Vision Transformers (ViTs) on resource-constrained devices is hindered by the prohibitively high computational cost of self-attention. Existing token pruning methods irreversibly discard unimportant tokens, causing permanent information loss and precluding cross-layer reuse. To address this, we propose Token Freezing and Reusing (ToFe), the first framework enabling **delayed freezing and dynamic reuse** of non-critical tokens. ToFe employs an importance prediction module to identify low-contribution tokens and temporarily freezes—rather than discards—them. An approximate recovery module enables cross-stage information reutilization, while computation-budget-aware end-to-end joint training ensures adaptive computational allocation. Evaluated on LV-ViT, ToFe achieves 50% FLOPs reduction with only a 1.8% drop in Top-1 accuracy—substantially outperforming state-of-the-art methods in the accuracy-efficiency trade-off.

Technology Category

Application Category

📝 Abstract

Although vision transformers (ViT) have shown remarkable success in various vision tasks, their computationally expensive self-attention hinder their deployment on resource-constrained devices. Token reduction, which discards less important tokens during forward propagation, has been proposed to enhance the efficiency of transformer models. However, existing methods handle unimportant tokens irreversibly, preventing their reuse in subsequent blocks. Considering that transformers focus on different information among blocks, tokens reduced in early blocks might be useful later. Furthermore, to adapt transformer models for resource-constrained devices, it is crucial to strike a balance between model performance and computational overhead. To address these challenges, in this paper, we introduce a novel Token Freezing and Reusing (ToFe) framework, where we identify important tokens at each stage and temporarily freeze the unimportant ones, allowing their lagged reusing at a later stage. Specifically, we design a prediction module for token identification and an approximate module for recovery of the frozen tokens. By jointly optimizing with the backbone through computation budget-aware end-to-end training, ToFe can adaptively process the necessary tokens at each block, thereby reducing computational cost while maintaining performance. Extensive experiments demonstrate that ToFe reduces the computational cost of LV-ViT model by 50% with less than 2% drop in Top-1 accuracy, achieving a better trade-off between performance and complexity compared to state-of-the-art methods.

Problem

Research questions and friction points this paper is trying to address.

Reduces computational cost of vision transformers

Enables reuse of unimportant tokens later

Balances model performance and efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lagged token freezing and reusing framework

Prediction module for token identification

Approximate module for frozen token recovery

🔎 Similar Papers

No similar papers found.