Where Do Tokens Go? Understanding Pruning Behaviors in STEP at High Resolutions

📅 2025-09-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the excessive computational and memory overhead of Vision Transformers (ViTs) in high-resolution semantic segmentation, this paper proposes STEP: a framework that jointly optimizes efficiency and accuracy. STEP introduces a dynamic Clustering-based Token Supergrouping (dCTS) policy network to generate variable-sized superblocks, and integrates an early-exit mechanism into the ViT-Large encoder to dynamically prune high-confidence supertokens. It further combines lightweight CNN-driven block merging with hybrid token reduction for fine-grained computational load control. On 1024×1024 images, STEP reduces FLOPs by up to 4×, accelerates inference by 1.7×, incurs ≤2.0% mIoU degradation, and terminates up to 40% of token processing early. Its core contribution is the first synergistic integration of dynamic supertoken construction and early exit into ViT-based semantic segmentation—enabling adaptive computation while preserving accuracy.

Technology Category

Application Category

📝 Abstract
Vision Transformers (ViTs) achieve state-of-the-art performance in semantic segmentation but are hindered by high computational and memory costs. To address this, we propose STEP (SuperToken and Early-Pruning), a hybrid token-reduction framework that combines dynamic patch merging and token pruning to enhance efficiency without significantly compromising accuracy. At the core of STEP is dCTS, a lightweight CNN-based policy network that enables flexible merging into superpatches. Encoder blocks integrate also early-exits to remove high-confident supertokens, lowering computational load. We evaluate our method on high-resolution semantic segmentation benchmarks, including images up to 1024 x 1024, and show that when dCTS is applied alone, the token count can be reduced by a factor of 2.5 compared to the standard 16 x 16 pixel patching scheme. This yields a 2.6x reduction in computational cost and a 3.4x increase in throughput when using ViT-Large as the backbone. Applying the full STEP framework further improves efficiency, reaching up to a 4x reduction in computational complexity and a 1.7x gain in inference speed, with a maximum accuracy drop of no more than 2.0%. With the proposed STEP configurations, up to 40% of tokens can be confidently predicted and halted before reaching the final encoder layer.
Problem

Research questions and friction points this paper is trying to address.

Reducing Vision Transformers computational and memory costs
Enhancing efficiency without significantly compromising accuracy
Pruning tokens in high-resolution semantic segmentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid token-reduction framework STEP
Lightweight CNN policy network dCTS
Early-exit mechanism for supertokens
🔎 Similar Papers
No similar papers found.
M
Michal Szczepanski
Université Paris-Saclay, CEA, List, F-91120, Palaiseau, France
Martyna Poreba
Martyna Poreba
Researcher
Computer Vision
K
Karim Haroun
I3S, Université Côte d’Azur, CNRS, Sophia Antipolis, 06900, France