Toward Adaptive Large Language Models Structured Pruning via Hybrid-grained Weight Importance Assessment

📅 2024-03-16
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Existing structured pruning methods for large language models (LLMs) rely predominantly on single-granularity importance estimation—either fine-grained (e.g., individual weights) or coarse-grained (e.g., structural blocks)—leading to substantial performance degradation and poor generalization after pruning. This work proposes an end-to-end adaptive hybrid-granularity pruning framework that jointly models both weight-level (fine-grained) and block-level (coarse-grained) importance, and introduces an attention mechanism to dynamically fuse these complementary signals for adaptive granularity selection. Evaluated on LLaMA-7B at 50% sparsity, our method achieves a 2.82% average accuracy gain across seven downstream tasks over LLM-Pruner. Extensive validation on diverse architectures—including Vicuna, Baichuan, and BLOOM—demonstrates strong cross-model generalizability. Our core contributions are: (1) hybrid-granularity importance modeling, (2) attention-driven adaptive importance fusion, and (3) a scalable, end-to-end structured pruning paradigm.

Technology Category

Application Category

📝 Abstract
Structured pruning for large language models (LLMs) has garnered significant academic interest due to its ability to efficiently compress and accelerate LLMs by eliminating redundant weight groups at a coarse-grained granularity. Current structured pruning methods for LLMs typically depend on a singular granularity for assessing weight importance, resulting in notable performance degradation in downstream tasks. Intriguingly, our empirical investigations reveal that utilizing unstructured pruning, which achieves better performance retention by pruning weights at a finer granularity, emph{i.e.}, individual weights, yields significantly varied sparse LLM structures when juxtaposed to structured pruning. This suggests that evaluating both holistic and individual assessment for weight importance is essential for LLM pruning. Building on this insight, we introduce the Hybrid-grained Weight Importance Assessment (HyWIA), a novel method that merges fine-grained and coarse-grained evaluations of weight importance for the pruning of LLMs. Leveraging an attention mechanism, HyWIA adaptively determines the optimal blend of granularity in weight importance assessments in an end-to-end pruning manner. Extensive experiments on LLaMA-V1/V2, Vicuna, Baichuan, and Bloom across various benchmarks demonstrate the effectiveness of HyWIA in pruning LLMs. For example, HyWIA surpasses the cutting-edge LLM-Pruner by an average margin of 2.82% in accuracy across seven downstream tasks when pruning LLaMA-7B by 50%.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Pruning Techniques
Performance Optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

HyWIA
Structured Pruning
Large Language Model Optimization
🔎 Similar Papers
No similar papers found.
J
Jun Liu
Northeastern University, Carnegie Mellon University
Zhenglun Kong
Zhenglun Kong
Harvard University
Efficient Deep LearningLarge Language ModelAI4Science
P
Pu Zhao
Northeastern University
Changdi Yang
Changdi Yang
PhD candidate, Northeastern University, Snap Inc.
Efficient Deep Learning
H
Hao Tang
Peking University
Xuan Shen
Xuan Shen
Cornell Tech, Northeastern University
Efficient Deep LearningML SystemsAutoML
Geng Yuan
Geng Yuan
University of Georgia
Efficient AIExplainable AITrustworthy MLEdge ComputingAI Applications
W
Wei Niu
University of Georgia
W
Wenbin Zhang
Florida International University
Xue Lin
Xue Lin
Northeastern University
electrical and computer engineering
D
Dong Huang
Carnegie Mellon University
Y
Yanzhi Wang
Northeastern University