Towards Extreme Pruning of LLMs with Plug-and-Play Mixed Sparsity

📅 2025-03-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing extreme pruning methods for large language models (LLMs) suffer from coarse-grained sparsity allocation, strong coupling between structural patterns and pruning strategies, and substantial performance degradation. Method: This paper proposes a layer-sensitivity-driven hybrid sparsity pruning framework. We introduce a novel layer-sensitivity quantification mechanism based on the trace of the Fisher information matrix, decoupling pruning strategy from model architecture; further, we design a pruning-oriented evolutionary algorithm to automatically optimize layer-wise N:M structured sparsity. The framework is plug-and-play and compatible with mainstream pruning pipelines. Contribution/Results: Applied to LLaMA and LLaMA-2, our method achieves 75% parameter pruning. On zero-shot inference and language modeling tasks, it reduces perplexity (PPL) by up to an order of magnitude compared to state-of-the-art methods, significantly alleviating the performance bottleneck in extreme pruning regimes.

Technology Category

Application Category

📝 Abstract
N:M structured pruning is essential for large language models (LLMs) because it can remove less important network weights and reduce the memory and computation requirements. Existing pruning methods mainly focus on designing metrics to measure the importance of network components to guide pruning. Apart from the impact of these metrics, we observe that different layers have different sensitivities over the network performance. Thus, we propose an efficient method based on the trace of Fisher Information Matrix (FIM) to quantitatively measure and verify the different sensitivities across layers. Based on this, we propose Mixed Sparsity Pruning (MSP) which uses a pruning-oriented evolutionary algorithm (EA) to determine the optimal sparsity levels for different layers. To guarantee fast convergence and achieve promising performance, we utilize efficient FIM-inspired layer-wise sensitivity to initialize the population of EA. In addition, our MSP can work as a plug-and-play module, ready to be integrated into existing pruning methods. Extensive experiments on LLaMA and LLaMA-2 on language modeling and zero-shot tasks demonstrate our superior performance. In particular, in extreme pruning ratio (e.g. 75%), our method significantly outperforms existing methods in terms of perplexity (PPL) by orders of magnitude (Figure 1).
Problem

Research questions and friction points this paper is trying to address.

Develops N:M structured pruning for LLMs to reduce memory and computation.
Proposes Mixed Sparsity Pruning using evolutionary algorithm for optimal sparsity.
Enhances pruning efficiency with Fisher Information Matrix-based sensitivity analysis.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Fisher Information Matrix for layer sensitivity
Implements Mixed Sparsity Pruning with evolutionary algorithm
Plug-and-play module for existing pruning methods
🔎 Similar Papers
No similar papers found.
C
Chi Xu
School of Informatics, Xiamen University, China
Gefei Zhang
Gefei Zhang
Zhejiang University of Technology
Education VisualizationHuman-computer Interaction
Y
Yantong Zhu
School of Informatics, Xiamen University, China
Luca Benini
Luca Benini
ETH Zürich, Università di Bologna
Integrated CircuitsComputer ArchitectureEmbedded SystemsVLSIMachine Learning
G
Guosheng Hu
University of Bristol, England
Y
Yawei Li
ETH Zurich, Switzerland
Z
Zhihong Zhang
School of Informatics, Xiamen University, China