🤖 AI Summary
To address the high memory footprint and deployment constraints of large language models (LLMs), this paper proposes DeltaLLM, an efficient post-training compression method. Its core innovation lies in the synergistic design of inter-layer weight sharing and low-rank differential matrices—inspired by SVD and LoRA—complemented by a progressive module replacement training strategy. DeltaLLM requires only 30–40 million tokens of lightweight training to achieve performance comparable to full fine-tuning. Evaluated on DeltaLlama and DeltaPhi, it reduces parameter count by 12% while retaining 90% of baseline zero-shot accuracy. Notably, DeltaPhi-2.9B (24% compressed) outperforms SlicedPhi-3.3B (12% compressed) on zero-shot tasks without any task-specific fine-tuning. The method achieves a favorable trade-off among compression ratio, accuracy preservation, and training efficiency, establishing a new paradigm for deploying LLMs under resource-constrained conditions.
📝 Abstract
We introduce DeltaLLM, a new post-training compression technique to reduce the memory footprint of LLMs. We propose an alternative way of structuring LLMs with weight sharing between layers in subsequent Transformer blocks, along with additional low-rank difference matrices between them. For training, we adopt the progressing module replacement method and show that the lightweight training of the low-rank modules with approximately 30M-40M tokens is sufficient to achieve performance on par with LLMs of comparable sizes trained from scratch. We release the resultant models, DeltaLLAMA and DeltaPHI, with a 12% parameter reduction, retaining 90% of the performance of the base Llama and Phi models on common knowledge and reasoning benchmarks. Our method also outperforms compression techniques JointDrop, LaCo, ShortGPT and SliceGPT with the same number of parameters removed. For example, DeltaPhi 2.9B with a 24% reduction achieves similar average zero-shot accuracies as recovery fine-tuned SlicedPhi 3.3B with a 12% reduction, despite being approximately 400M parameters smaller with no fine-tuning applied. This work provides new insights into LLM architecture design and compression methods when storage space is critical.