PSViT: A Methodology for Structurally Pruning Spiking Vision Transformers

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

This work addresses the challenge of efficiently deploying the Spiking Vision Transformer (SViT) on resource-constrained embedded platforms, where its large model size poses significant barriers. Existing pruning approaches are predominantly unstructured, requiring specialized hardware and offering poor scalability. To overcome these limitations, this study introduces structured pruning to SViT for the first time, proposing a fine-grained channel pruning strategy that integrates uniform channel reduction, layer-wise sensitivity analysis, and architectural constraints. The method enables efficient inference on general-purpose hardware without relying on sparse-computation accelerators. On ImageNet-1K, a single pruning pass reduces memory usage by 22.4%, and after fine-tuning, achieves an accuracy of 72.8%—only a 0.5% drop from the original 73.3%—demonstrating substantially improved deployment efficiency and scalability.

📝 Abstract

Spiking Vision Transformer (SViT) models are promising low-power ViT models for solving vision-based tasks with state-of-the-art performance. However, their large sizes limit their deployments for resource-constrained embedded platforms, underscoring the needs of model compression. One of prominent compression techniques is pruning, and the state-of-the-art works employ unstructured pruning techniques to compress SViT models. Such techniques require specialized hardware architectures tailored for the sparsity patterns to maximize their efficiency benefits, making this approach not scalable. To address this, we propose PSViT, a novel methodology to perform structured pruning on SViT models, hence making it possible to efficiently accelerate their inference using the existing and widely-used computing architectures. To do this, PSViT employs several key steps: uniform channel-wise filter pruning to structurally eliminate the non-significant weights, sensitivity analysis to evaluate the impact of channel-wise pruning of individual layer on accuracy and network size, as well as fine-grained channel-wise pruning based on the sensitivity analysis and the given network architecture. Experimental results show that PSViT effectively obtains 22.4% memory saving through single-shot pruning, while maintaining high accuracy within 3% (70.3% without fine-tuning and 72.8% with fine-tuning) from the original non-pruned SViT model (73.3%) on the ImageNet-1K. These results also show that the PSViT methodology advances the effort in enabling efficient SViT deployments on resource-constrained applications.

Problem

Research questions and friction points this paper is trying to address.

Spiking Vision Transformer

model compression

structured pruning

resource-constrained deployment

hardware efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

structured pruning

Spiking Vision Transformer

channel-wise pruning