Olica: Efficient Structured Pruning of Large Language Models without Retraining

📅 2025-06-10

📈 Citations: 0

✨ Influential: 0

career value

261K/year

🤖 AI Summary

Existing structured pruning methods for large language models (LLMs) heavily rely on extensive data and computational resources for post-pruning retraining to restore model functionality, incurring prohibitive costs. Method: We propose Olica, the first training-free structured pruning framework for LLMs. It unifies matrix multiplications in multi-head attention via principal component analysis (PCA) for compression; introduces a low-rank linear calibration to mitigate feed-forward network (FFN) pruning error; and employs orthogonal decomposition combined with least-squares optimization for efficient parameter recovery—requiring only a single forward pass and zero training data. Contribution/Results: Olica achieves accuracy close to the full model across multiple benchmarks, significantly outperforming existing training-free pruning approaches. It drastically reduces GPU memory consumption and inference latency while maintaining structural integrity and eliminating retraining overhead.

Technology Category

Application Category

📝 Abstract

Most existing structured pruning methods for Large Language Models (LLMs) require substantial computational and data resources for retraining to reestablish the corrupted correlations, making them prohibitively expensive. To address this, we propose a pruning framework for LLMs called Orthogonal decomposition and Linear Calibration (Olica), which eliminates the need for retraining. A key observation is that the multi-head attention (MHA) layer depends on two types of matrix products. By treating these matrix products as unified entities and applying principal component analysis (PCA), we extract the most important information to compress LLMs without sacrificing accuracy or disrupting their original structure. Consequently, retraining becomes unnecessary. A fast decomposition method is devised, reducing the complexity of PCA by a factor of the square of the number of attention heads. Additionally, to mitigate error accumulation problem caused by pruning the feed-forward network (FFN) layer, we introduce a linear calibration method to reconstruct the residual errors of pruned layers using low-rank matrices. By leveraging singular value decomposition (SVD) on the solution of the least-squares problem, these matrices are obtained without requiring retraining. Extensive experiments show that the proposed Olica is efficient in terms of data usage, GPU memory, and running time, while delivering superior performance across multiple benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Eliminates retraining need for structured pruning in LLMs

Compresses LLMs via PCA without accuracy loss

Reduces error accumulation in pruned FFN layers

Innovation

Methods, ideas, or system contributions that make the work stand out.

PCA-based compression without retraining LLMs

Fast decomposition reduces PCA complexity

Linear calibration with SVD for error mitigation

🔎 Similar Papers

No similar papers found.