🤖 AI Summary
This work addresses the challenge of achieving high-accuracy structured pruning under extremely limited fine-tuning budgets. Leveraging Marchenko–Pastur random matrix theory, the authors analyze the impact of weight perturbations on output logits and propose a zero-shot perfect pruning mechanism along with a prune–restore extension strategy. By dynamically allocating layer-wise pruning budgets based on the edge of the MP spectral distribution, the method eliminates the need for prolonged retraining. It supports diverse structured sparsity patterns (e.g., 2:4, 6:12) and is compatible with both Vision Transformers and CNNs. On ImageNet-1k, a pruned ViT-B/16 achieves 83.41% top-1 accuracy after only three distillation rounds—just 1.70 percentage points below the dense baseline—while reducing MACs by 59.81% and accelerating inference by 1.388× on an A40 GPU. Across multiple models, accuracy degradation remains below 1.7 points at approximately 50% MAC reduction.
📝 Abstract
We study a Marchenko--Pastur (MP) random-matrix approach to pruning deep neural networks with very small post-pruning fine-tuning budgets. The main practical contribution is accuracy retention under short calibration and fine-tuning schedules, rather than a long post-pruning reoptimization pipeline. The theory gives deterministic data-path certificates: if the removed component $R$ has small propagated logit effect $L_s \| R \psi_1(s) \|_\infty$, pruning decreases an elastic-net objective and preserves samples whose dense margin exceeds twice the perturbation. The zero-budget case gives perfect pruning; a prune--restore extension models weight restoration inside a fixed sparse-execution pattern; and an additive $L_2$-regularized model shows admissible random-like components vanish at the training limit, with persistent spikes stabilizing as the MP bulk collapses. Under iid-Gaussian sufficient conditions, the fitted MP edge $\sigma_+$ gives a high-probability layerwise budget signal. On ImageNet-1k, after only three distillation epochs, ViT-B/16 $2{:}4{+}$ToMe reaches $83.41\%$ top-1 ($-1.70$ pp from dense) at $59.81\%$ sparse-execution MAC reduction, with $1.388\times$ best-observed A40 native-$2{:}4$ backend speedup for the same checkpoint and ToMe graph; a separate no-ToMe A100 endpoint gives $2.705\times$. At structured sparsity, ViT-B/16 $6{:}12$ reaches $83.74\%$, ViT-L/16 $8{:}16$ dense+permutation reaches $85.33\%$ ($-0.51$ pp), and ConvNeXtV2-Base $12{:}16$ reaches $86.35\%$ ($-0.37$ pp). For CNNs, ResNet50 $8{:}16$ dense+permutation reaches $75.87\%$ ($-0.26$ pp), and ResNet152d CAST-conv+permutation reaches $81.33\%$ ($-1.53$ pp) at ${\sim}50\%$ MAC accounting with a $1.62\times$ A40 im2col$+2{:}4$ sparse-GEMM audit.