🤖 AI Summary
Balancing accuracy and efficiency remains challenging in large language model (LLM) deployment. Method: This paper proposes an efficient, training-free, multi-dimensional pruning method tailored for Transformer architectures. Unlike conventional single-dimension block pruning, our approach jointly and iteratively prunes across three dimensions—residual blocks, MLP channels, and attention heads—augmented by structural rebalancing and hardware-aware sparsification to enable fine-grained compression. Contribution/Results: The proposed 3D collaborative pruning framework is the first to unify cross-module sparsity modeling without requiring labeled data or fine-tuning, significantly improving zero-shot downstream task accuracy. Experiments on multiple state-of-the-art LLMs demonstrate that our method outperforms existing training-free pruning approaches, achieving an average 23% higher compression ratio, with 28% reduction in computational cost and 31% lower memory footprint, while maintaining—or even exceeding—baseline accuracy.
📝 Abstract
Recently, state-of-the-art approaches for pruning large pre-trained models (LPMs) have demonstrated that the training-free removal of non-critical residual blocks in Transformers is viable for reducing model size, achieving results that outperform previous training-free pruning approaches. Motivated by these findings, we extend BlockPruner (Zhong et al., 2024) and propose MultiPruner, a pruning approach that surpasses recent training-free pruning methods by adopting a multidimensional, iterative, fine-grained pruning strategy. In MultiPruner, multidimensional pruning reinstates the structural balance in block-pruned models by sequentially compressing along three dimensions: i) residual blocks, ii) channels of multilayer perceptrons (MLP), and iii) attention heads. This solution enhances zero-shot accuracy on downstream tasks compared to other techniques while improving model compression ratios, producing compressed models with fewer computing and memory requirements. Extensive experiments demonstrate the advantages of the proposed method across various large pre-trained models. The code and pruning configurations are available at https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning.