🤖 AI Summary
To address the high memory overhead and low throughput in concurrent deployment of multiple fully fine-tuned large language models (LLMs), this paper proposes a system-level co-optimization framework centered on aggressive compression of fine-tuning deltas. We introduce, for the first time, a quantization-plus-sparsity coding scheme that exploits both the sparsity and low-magnitude characteristics of delta parameters, integrated with memory-aware scheduling and incremental loading at runtime. Our approach achieves up to 10× compression of delta parameters while preserving generation quality. Compared to state-of-the-art LLM serving systems, it delivers 2–12× higher throughput, enabling low-latency, high-concurrency service under bursty workloads. The framework provides a scalable solution for efficient co-location of dozens of fine-tuned models on shared infrastructure.
📝 Abstract
Fine-tuning large language models (LLMs) greatly improves model quality for downstream tasks. However, serving many fine-tuned LLMs concurrently is challenging due to the sporadic, bursty, and varying request patterns of different LLMs. To bridge this gap, we present DeltaZip, an LLM serving system that efficiently serves multiple full-parameter fine-tuned models concurrently by aggressively compressing model deltas by up to 10x while maintaining high model quality. The key insight behind this design is that fine-tuning results in small-magnitude changes to the pre-trained model. By co-designing the serving system with the compression algorithm, DeltaZip achieves 2x to 12x improvement in throughput compared to the state-of-the-art systems.