🤖 AI Summary
This work addresses parameter redundancy in the MLP layers of GPT-2 via a Kronecker-product-based targeted compression method. To tackle convergence instability and degraded expressivity in low-rank Kronecker factorization, we propose two key innovations: (1) an enhanced Van Loan decomposition initialization strategy that improves training stability of Kronecker factors; and (2) a pruning-guided heuristic initialization that preserves model capacity post-compression. Our approach compresses the 124M-parameter GPT-2 to 80–96M parameters—with the best variant at just 81M—while surpassing DistilGPT-2 (117M) across standard language modeling benchmarks. Moreover, it outperforms existing Kronecker-compressed variants with larger parameter counts (e.g., >96M), marking the first demonstration of *parameter reduction without performance degradation*, and even *smaller models outperforming larger ones*. This represents a significant advance in efficient transformer architecture design.
📝 Abstract
We introduce Krony-PT, a compression technique of GPT2 citep{radford2019language} based on Kronecker Products. We specifically target the MLP layers of each transformer layer, and systematically compress the feed forward layer matrices to various degrees. We introduce a modified Van Loan decomposition to initialize the new factors, and also introduce a new pruning-based initialization trick. Our method compresses the original 124M parameter GPT2 to various smaller models, with 80M being the smallest, and 96M being the largest compressed model. Our 81M model variant outperforms distilgpt2 on next-token prediction on all standard language modeling datasets, and shows competitive scores or performs on par with other Kronecker Products based compressed models of GPT2 that are significantly higher in size.