Compression Laws for Large Language Models

📅 2025-04-06

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work investigates the mechanistic impact of structured compression on downstream task performance of pre-trained large language models (LLMs). To this end, we conduct over 1,000 cross-scale experiments across model sizes ranging from 0.5B to 14B parameters. We establish, for the first time, an LLM compression law: generative loss scales quadratically with compression ratio, whereas downstream task performance degrades approximately linearly. Building upon this insight, we propose recovery fine-tuning—a lightweight adaptation method that improves generation quality by 55% for highly compressed models (90% pruning rate). Empirical results show that 90% compression yields up to 60% inference speedup; however, models smaller than 7B exhibit a speedup ceiling of 35%. Our study provides both theoretical grounding and practical guidelines for performance–efficiency trade-offs, enabling deployable compression strategies in resource-constrained environments.

Technology Category

Application Category

📝 Abstract

We introduce compression laws for language language models (LLMs). While recent scaling laws have sought to understand how LLMs scale with respect to model size, pre-training data, and computational resources, we focus on understanding how model compression affects the performance of a pre-trained LLM on downstream tasks. We empirically examine the effects of structured model compression on LLMs through over $1000$ experiments across eight models with sizes ranging from $0.5B$ to $14B$ parameters. Our findings indicate that the test cross-entropy loss increases quadratically with the compression ratio, whereas performance on downstream tasks declines only linearly. Our study emphasizes the importance of recovery fine-tuning in enhancing generation loss, showing that the test loss of compressed LLMs can improve by up to 55% with recovery fine-tuning. At higher compression ratios (up to 90%), compressed LLMs demonstrate a speed increase of 60% during inference compared to their uncompressed counterparts, compensating for the performance degradation at this level. However, for smaller models ($le 7B$), the computational gains are limited, peaking at just 35%. We conclude that model compression can be highly beneficial for larger models, especially when a smaller model within the same computational budget is not available. These insights provide the practical guidelines for utilizing model compression techniques for adopting LLMs in real-life applications in resource-constrained settings.

Problem

Research questions and friction points this paper is trying to address.

Understanding how model compression affects pre-trained LLM performance

Examining structured compression effects on LLMs via extensive experiments

Evaluating trade-offs between compression ratios and downstream task performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Empirical study of structured compression on LLMs

Quadratic loss increase with compression ratio

Recovery fine-tuning improves loss by 55%

🔎 Similar Papers

No similar papers found.

Authors to Follow