DeltaZip: Efficient Serving of Multiple Full-Model-Tuned LLMs

📅 2023-12-08

📈 Citations: 6

✨ Influential: 2

career value

226K/year

🤖 AI Summary

To address the high memory overhead and low throughput in concurrent deployment of multiple fully fine-tuned large language models (LLMs), this paper proposes a system-level co-optimization framework centered on aggressive compression of fine-tuning deltas. We introduce, for the first time, a quantization-plus-sparsity coding scheme that exploits both the sparsity and low-magnitude characteristics of delta parameters, integrated with memory-aware scheduling and incremental loading at runtime. Our approach achieves up to 10× compression of delta parameters while preserving generation quality. Compared to state-of-the-art LLM serving systems, it delivers 2–12× higher throughput, enabling low-latency, high-concurrency service under bursty workloads. The framework provides a scalable solution for efficient co-location of dozens of fine-tuned models on shared infrastructure.

📝 Abstract

Fine-tuning large language models (LLMs) greatly improves model quality for downstream tasks. However, serving many fine-tuned LLMs concurrently is challenging due to the sporadic, bursty, and varying request patterns of different LLMs. To bridge this gap, we present DeltaZip, an LLM serving system that efficiently serves multiple full-parameter fine-tuned models concurrently by aggressively compressing model deltas by up to 10x while maintaining high model quality. The key insight behind this design is that fine-tuning results in small-magnitude changes to the pre-trained model. By co-designing the serving system with the compression algorithm, DeltaZip achieves 2x to 12x improvement in throughput compared to the state-of-the-art systems.

Problem

Research questions and friction points this paper is trying to address.

Efficiently serving multiple fine-tuned LLMs concurrently

Handling sporadic and varying request patterns of LLMs

Compressing model deltas while maintaining high quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Aggressively compresses model deltas up to 10x

Co-designs serving system with compression algorithm

Improves throughput 2x to 12x over state-of-the-art

🔎 Similar Papers

ZACK: Zero-Overhead LLM Inference Acceleration via Dimensionality Compression of the Key-Value Cache