Krony-PT: GPT2 compressed with Kronecker Products

📅 2024-12-16
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses parameter redundancy in the MLP layers of GPT-2 via a Kronecker-product-based targeted compression method. To tackle convergence instability and degraded expressivity in low-rank Kronecker factorization, we propose two key innovations: (1) an enhanced Van Loan decomposition initialization strategy that improves training stability of Kronecker factors; and (2) a pruning-guided heuristic initialization that preserves model capacity post-compression. Our approach compresses the 124M-parameter GPT-2 to 80–96M parameters—with the best variant at just 81M—while surpassing DistilGPT-2 (117M) across standard language modeling benchmarks. Moreover, it outperforms existing Kronecker-compressed variants with larger parameter counts (e.g., >96M), marking the first demonstration of *parameter reduction without performance degradation*, and even *smaller models outperforming larger ones*. This represents a significant advance in efficient transformer architecture design.

Technology Category

Application Category

📝 Abstract
We introduce Krony-PT, a compression technique of GPT2 citep{radford2019language} based on Kronecker Products. We specifically target the MLP layers of each transformer layer, and systematically compress the feed forward layer matrices to various degrees. We introduce a modified Van Loan decomposition to initialize the new factors, and also introduce a new pruning-based initialization trick. Our method compresses the original 124M parameter GPT2 to various smaller models, with 80M being the smallest, and 96M being the largest compressed model. Our 81M model variant outperforms distilgpt2 on next-token prediction on all standard language modeling datasets, and shows competitive scores or performs on par with other Kronecker Products based compressed models of GPT2 that are significantly higher in size.
Problem

Research questions and friction points this paper is trying to address.

Compressing GPT-2 model parameters using Kronecker products
Reducing feed-forward layer sizes in transformer blocks
Achieving competitive performance with smaller compressed models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Compresses GPT-2 using Kronecker products
Applies modified Van Loan decomposition for initialization
Introduces pruning-based initialization for model compression
B
Ben Ayad
University of Passau, Germany
M
Mohamed Ayoub
University of Passau, Germany
Jelena Mitrović
Jelena Mitrović
University of Passau
Natural Language ProcessingArtificial IntelligenceComputational RhetoricLegal NLP
M
Michael Granitzer
University of Passau, Germany