Krony-PT: GPT2 compressed with Kronecker Products

📅 2024-12-16

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work addresses parameter redundancy in the MLP layers of GPT-2 via a Kronecker-product-based targeted compression method. To tackle convergence instability and degraded expressivity in low-rank Kronecker factorization, we propose two key innovations: (1) an enhanced Van Loan decomposition initialization strategy that improves training stability of Kronecker factors; and (2) a pruning-guided heuristic initialization that preserves model capacity post-compression. Our approach compresses the 124M-parameter GPT-2 to 80–96M parameters—with the best variant at just 81M—while surpassing DistilGPT-2 (117M) across standard language modeling benchmarks. Moreover, it outperforms existing Kronecker-compressed variants with larger parameter counts (e.g., >96M), marking the first demonstration of *parameter reduction without performance degradation*, and even *smaller models outperforming larger ones*. This represents a significant advance in efficient transformer architecture design.

Technology Category

Application Category

📝 Abstract

We introduce Krony-PT, a compression technique of GPT2 citep{radford2019language} based on Kronecker Products. We specifically target the MLP layers of each transformer layer, and systematically compress the feed forward layer matrices to various degrees. We introduce a modified Van Loan decomposition to initialize the new factors, and also introduce a new pruning-based initialization trick. Our method compresses the original 124M parameter GPT2 to various smaller models, with 80M being the smallest, and 96M being the largest compressed model. Our 81M model variant outperforms distilgpt2 on next-token prediction on all standard language modeling datasets, and shows competitive scores or performs on par with other Kronecker Products based compressed models of GPT2 that are significantly higher in size.

Problem

Research questions and friction points this paper is trying to address.

Compressing GPT-2 model parameters using Kronecker products

Reducing feed-forward layer sizes in transformer blocks

Achieving competitive performance with smaller compressed models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Compresses GPT-2 using Kronecker products

Applies modified Van Loan decomposition for initialization

Introduces pruning-based initialization for model compression

🔎 Similar Papers

Position IDs Matter: An Enhanced Position Layout for Efficient Context Compression in Large Language Models