EvoPress: Towards Optimal Dynamic Model Compression via Evolutionary Search

📅 2024-10-18

🏛️ arXiv.org

📈 Citations: 5

✨ Influential: 2

career value

192K/year

🤖 AI Summary

Existing dynamic non-uniform compression methods for large language models (LLMs) suffer from inter-layer error dependence and the breakdown of conventional monotonicity assumptions, rendering layer-wise error decomposition infeasible and undermining theoretical guarantees. Method: This paper proposes the first evolutionary search framework for dynamic LLM compression with provable convergence, jointly optimizing layer- and block-wise structured pruning, unstructured sparsity, and dynamic bit-width quantization under a global compression constraint. Contribution/Results: By explicitly modeling non-independent layer errors and abandoning restrictive monotonicity assumptions, our framework identifies theoretically optimal configurations that minimize accuracy degradation. It establishes new state-of-the-art results on Llama, Mistral, and Phi model families across diverse compression paradigms, achieving superior accuracy–compression ratio trade-offs compared to all prior approaches.

Technology Category

Application Category

📝 Abstract

The high computational costs of large language models (LLMs) have led to a flurry of research on LLM compression, via methods such as quantization, sparsification, or structured pruning. A new frontier in this area is given by emph{dynamic, non-uniform} compression methods, which adjust the compression levels (e.g., sparsity) per-block or even per-layer in order to minimize accuracy loss, while guaranteeing a global compression threshold. Yet, current methods rely on heuristics for identifying the"importance"of a given layer towards the loss, based on assumptions such as emph{error monotonicity}, i.e. that the end-to-end model compression error is proportional to the sum of layer-wise errors. In this paper, we revisit this area, and propose a new and general approach for dynamic compression that is provably optimal in a given input range. We begin from the motivating observation that, in general, emph{error monotonicity does not hold for LLMs}: compressed models with lower sum of per-layer errors can perform emph{worse} than models with higher error sums. To address this, we propose a new general evolutionary framework for dynamic LLM compression called EvoPress, which has provable convergence, and low sample and evaluation complexity. We show that these theoretical guarantees lead to highly competitive practical performance for dynamic compression of Llama, Mistral and Phi models. Via EvoPress, we set new state-of-the-art results across all compression approaches: structural pruning (block/layer dropping), unstructured sparsity, as well as quantization with dynamic bitwidths. Our code is available at https://github.com/IST-DASLab/EvoPress.

Problem

Research questions and friction points this paper is trying to address.

Dynamic non-uniform compression for large language models

Overcoming layer independence assumption in model compression

Optimizing compression profiles via evolutionary search framework

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evolutionary search for dynamic model compression

Optimizes compression profiles across diverse models

Achieves state-of-the-art performance in LLM compression

🔎 Similar Papers

Tiny Models are the Computational Saver for Large Models