🤖 AI Summary
To address the excessive computational and memory overhead of Vision Transformers (ViTs) in high-resolution semantic segmentation, this paper proposes STEP: a framework that jointly optimizes efficiency and accuracy. STEP introduces a dynamic Clustering-based Token Supergrouping (dCTS) policy network to generate variable-sized superblocks, and integrates an early-exit mechanism into the ViT-Large encoder to dynamically prune high-confidence supertokens. It further combines lightweight CNN-driven block merging with hybrid token reduction for fine-grained computational load control. On 1024×1024 images, STEP reduces FLOPs by up to 4×, accelerates inference by 1.7×, incurs ≤2.0% mIoU degradation, and terminates up to 40% of token processing early. Its core contribution is the first synergistic integration of dynamic supertoken construction and early exit into ViT-based semantic segmentation—enabling adaptive computation while preserving accuracy.
📝 Abstract
Vision Transformers (ViTs) achieve state-of-the-art performance in semantic segmentation but are hindered by high computational and memory costs. To address this, we propose STEP (SuperToken and Early-Pruning), a hybrid token-reduction framework that combines dynamic patch merging and token pruning to enhance efficiency without significantly compromising accuracy. At the core of STEP is dCTS, a lightweight CNN-based policy network that enables flexible merging into superpatches. Encoder blocks integrate also early-exits to remove high-confident supertokens, lowering computational load. We evaluate our method on high-resolution semantic segmentation benchmarks, including images up to 1024 x 1024, and show that when dCTS is applied alone, the token count can be reduced by a factor of 2.5 compared to the standard 16 x 16 pixel patching scheme. This yields a 2.6x reduction in computational cost and a 3.4x increase in throughput when using ViT-Large as the backbone. Applying the full STEP framework further improves efficiency, reaching up to a 4x reduction in computational complexity and a 1.7x gain in inference speed, with a maximum accuracy drop of no more than 2.0%. With the proposed STEP configurations, up to 40% of tokens can be confidently predicted and halted before reaching the final encoder layer.