Tangram: Accelerating Serverless LLM Loading through GPU Memory Reuse and Affinity

📅 2025-12-01

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

Serverless LLM deployment suffers from GPU resource contention, causing severe cold-start latency—particularly during model loading—with delays scaling linearly with model size. To address this, we propose an efficient GPU memory reuse framework tailored for serverless environments: (1) a unified GPU memory pool enabling tensor-level cross-model parameter sharing; (2) an on-demand KV cache allocation policy; and (3) a GPU-affinity-aware scheduling algorithm for fine-grained, dynamic memory management. This is the first approach to deeply integrate fine-grained GPU memory reuse with affinity-aware scheduling. Evaluations against state-of-the-art baselines show up to 6.2× faster model loading and 23–55% reduction in time-to-first-token (TTFT), significantly alleviating cold-start bottlenecks for large-scale LLMs on serverless platforms.

Technology Category

Application Category

📝 Abstract

Serverless Large Language Models (LLMs) have emerged as a cost-effective solution for deploying AI services by enabling a 'pay-as-you-go' pricing model through GPU resource sharing. However, cold-start latency, especially the model loading phase, has become a critical performance bottleneck, as it scales linearly with model size and severely limits the practical deployment of large-scale LLM services. This paper presents Tangram, a novel system that accelerates Serverless LLM loading through efficient GPU memory reuse. By leveraging the unused GPU memory to retain model parameters, Tangram significantly reduces model transfer time and cold-start latency. Its design includes three key components: unified GPU memory pool for tensor-level parameter sharing across models, on-demand KV cache allocation for dynamic memory management, and GPU-affinity-aware scheduling for maximizing resource utilization. These techniques collectively address the critical challenges of inefficient memory usage and the cold-start problem in Serverless LLM platforms. We have implemented a fully functional prototype, and experiments show that Tangram achieves up to 6.2 times faster loading and reduces Time-To-First-Token (TTFT) during cold-start by 23--55% over state-of-the-art methods.

Problem

Research questions and friction points this paper is trying to address.

Accelerates serverless LLM loading to reduce cold-start latency.

Addresses inefficient GPU memory usage through parameter reuse.

Improves resource utilization with affinity-aware scheduling and dynamic management.

Innovation

Methods, ideas, or system contributions that make the work stand out.

GPU memory reuse for parameter retention

Unified memory pool for tensor-level sharing

GPU-affinity-aware scheduling for resource optimization

🔎 Similar Papers

RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval