SPQ: An Ensemble Technique for Large Language Model Compression

๐Ÿ“… 2026-02-20
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

229K/year
๐Ÿค– AI Summary
This work addresses the challenge of deploying large language models in resource-constrained environments due to their high memory footprint and low inference efficiency. The authors propose SPQ, a unified compression framework that integrates variance-preserving singular value decomposition (SVD), MLP activationโ€“based pruning, and post-training 8-bit linear quantization within a layer-aware, synergistic pipeline. Evaluated on LLaMA-2-7B, SPQ achieves a 75% reduction in memory usage while improving the WikiText-2 perplexity from 5.47 to 4.91. Moreover, it delivers a 1.9ร— higher inference throughput compared to GPTQ and maintains or even enhances accuracy across multiple benchmarks, including C4, TruthfulQA, and GSM8K.

Technology Category

Application Category

๐Ÿ“ Abstract
This study presents an ensemble technique, SPQ (SVD-Pruning-Quantization), for large language model (LLM) compression that combines variance-retained singular value decomposition (SVD), activation-based pruning, and post-training linear quantization. Each component targets a different source of inefficiency: i) pruning removes redundant neurons in MLP layers, ii) SVD reduces attention projections into compact low-rank factors, iii) and 8-bit quantization uniformly compresses all linear layers. At matched compression ratios, SPQ outperforms individual methods (SVD-only, pruning-only, or quantization-only) in perplexity, demonstrating the benefit of combining complementary techniques. Applied to LLaMA-2-7B, SPQ achieves up to 75% memory reduction while maintaining or improving perplexity (e.g., WikiText-2 5.47 to 4.91) and preserving accuracy on downstream benchmarks such as C4, TruthfulQA, and GSM8K. Compared to strong baselines like GPTQ and SparseGPT, SPQ offers competitive perplexity and accuracy while using less memory (6.86 GB vs. 7.16 GB for GPTQ). Moreover, SPQ improves inference throughput over GPTQ, achieving up to a 1.9x speedup, which further enhances its practicality for real-world deployment. The effectiveness of SPQ's robust compression through layer-aware and complementary compression techniques may provide practical deployment of LLMs in memory-constrained environments. Code is available at: https://github.com/JiaminYao/SPQ_LLM_Compression/
Problem

Research questions and friction points this paper is trying to address.

large language model compression
memory efficiency
inference speedup
model deployment
resource-constrained environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

SVD
Pruning
Quantization
LLM Compression
Ensemble Compression
๐Ÿ”Ž Similar Papers
No similar papers found.