🤖 AI Summary
To address the challenges of non-differentiable discrete token pruning and high inference overhead in training-time sparsification of Vision Transformers, this paper proposes a novel differentiable noise-injection paradigm: discrete pruning is relaxed into continuous additive noise injection, with learnable noise intensity parameters; crucially, we establish—for the first time—the intrinsic connection between token pruning and rate-distortion (R-D) theory, enabling an R-D-inspired joint loss function. This formulation permits smooth end-to-end optimization during training and yields zero-overhead discrete pruning at deployment. Evaluated on ImageNet, our method significantly outperforms existing pruning approaches: at comparable accuracy, it achieves substantial gains in inference throughput—without requiring fine-tuning or architectural modifications.
📝 Abstract
In the present work we present Training Noise Token (TNT) Pruning for vision transformers. Our method relaxes the discrete token dropping condition to continuous additive noise, providing smooth optimization in training, while retaining discrete dropping computational gains in deployment settings. We provide theoretical connections to Rate-Distortion literature, and empirical evaluations on the ImageNet dataset using ViT and DeiT architectures demonstrating TNT's advantages over previous pruning methods.