DeltaLLM: Compress LLMs with Low-Rank Deltas between Shared Weights

📅 2025-01-30

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

To address the high memory footprint and deployment constraints of large language models (LLMs), this paper proposes DeltaLLM, an efficient post-training compression method. Its core innovation lies in the synergistic design of inter-layer weight sharing and low-rank differential matrices—inspired by SVD and LoRA—complemented by a progressive module replacement training strategy. DeltaLLM requires only 30–40 million tokens of lightweight training to achieve performance comparable to full fine-tuning. Evaluated on DeltaLlama and DeltaPhi, it reduces parameter count by 12% while retaining 90% of baseline zero-shot accuracy. Notably, DeltaPhi-2.9B (24% compressed) outperforms SlicedPhi-3.3B (12% compressed) on zero-shot tasks without any task-specific fine-tuning. The method achieves a favorable trade-off among compression ratio, accuracy preservation, and training efficiency, establishing a new paradigm for deploying LLMs under resource-constrained conditions.

Technology Category

Application Category

📝 Abstract

We introduce DeltaLLM, a new post-training compression technique to reduce the memory footprint of LLMs. We propose an alternative way of structuring LLMs with weight sharing between layers in subsequent Transformer blocks, along with additional low-rank difference matrices between them. For training, we adopt the progressing module replacement method and show that the lightweight training of the low-rank modules with approximately 30M-40M tokens is sufficient to achieve performance on par with LLMs of comparable sizes trained from scratch. We release the resultant models, DeltaLLAMA and DeltaPHI, with a 12% parameter reduction, retaining 90% of the performance of the base Llama and Phi models on common knowledge and reasoning benchmarks. Our method also outperforms compression techniques JointDrop, LaCo, ShortGPT and SliceGPT with the same number of parameters removed. For example, DeltaPhi 2.9B with a 24% reduction achieves similar average zero-shot accuracies as recovery fine-tuned SlicedPhi 3.3B with a 12% reduction, despite being approximately 400M parameters smaller with no fine-tuning applied. This work provides new insights into LLM architecture design and compression methods when storage space is critical.

Problem

Research questions and friction points this paper is trying to address.

Large Language Model Compression

Performance Preservation

Memory Reduction

Innovation

Methods, ideas, or system contributions that make the work stand out.

DeltaLLM

EfficientCompression

HighPerformanceRetainment

🔎 Similar Papers

Position IDs Matter: An Enhanced Position Layout for Efficient Context Compression in Large Language Models