🤖 AI Summary
Existing layer-wise sparsity allocation strategies for large language model (LLM) pruning rely heavily on heuristics or expensive search, leading to suboptimal performance and limited interpretability.
Method: This paper proposes a redundancy-driven layer-wise sparsity allocation framework. It first establishes a strong empirical correlation between layer redundancy uniformity and model performance, from which three principled pruning criteria are derived: non-uniformity, metric dependence, and redundancy uniformity. Building upon these, we introduce the Mechanism-Driven Maximum Redundancy Pruning (MRP) algorithm, which iteratively identifies and prunes the most redundant layer—defined as the one with the highest non-outlier ratio—enabling automatic, interpretable, and theory-grounded sparsity assignment.
Results: Evaluated on LLaMA2 and OPT across multiple benchmarks, MRP achieves average accuracy gains of 1.8–3.2 points at equivalent sparsity levels, while significantly improving redundancy distribution uniformity across layers—demonstrating both effectiveness and generalizability of our principle-driven design.
📝 Abstract
Large language models (LLMs) have demonstrated impressive capabilities, but their enormous size poses significant challenges for deployment in real-world applications. To address this issue, researchers have sought to apply network pruning techniques to LLMs. A critical challenge in pruning is allocation the sparsity for each layer. Recent sparsity allocation methods is often based on heuristics or search that can easily lead to suboptimal performance. In this paper, we conducted an extensive investigation into various LLMs and revealed three significant discoveries: (1) the layerwise pruning sensitivity (LPS) of LLMs is highly non-uniform, (2) the choice of pruning metric affects LPS, and (3) the performance of a sparse model is related to the uniformity of its layerwise redundancy level. Based on these observations, we propose that the layerwise sparsity of LLMs should adhere to three principles: emph{non-uniformity}, emph{pruning metric dependency}, and emph{uniform layerwise redundancy level} in the pruned model. To this end, we proposed Maximum Redundancy Pruning (MRP), an iterative pruning algorithm that prunes in the most redundant layers (emph{i.e.}, those with the highest non-outlier ratio) at each iteration. The achieved layerwise sparsity aligns with the outlined principles. We conducted extensive experiments on publicly available LLMs, including the LLaMA2 and OPT, across various benchmarks. Experimental results validate the effectiveness of MRP, demonstrating its superiority over previous methods.