HAS-GPU: Efficient Hybrid Auto-scaling with Fine-grained GPU Allocation for SLO-aware Serverless Inferences

📅 2025-05-04

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

To address high cost and SLO violations in serverless deep learning inference caused by coarse-grained, static GPU resource allocation, this paper proposes the first hybrid elastic scheduling architecture supporting SM-level fine-grained vertical scaling and cold-start optimization. Our approach tackles three key challenges: (1) runtime, arbitrary-granularity partitioning of GPU Streaming Multiprocessors (SMs) with dynamic temporal quota assignment; (2) a resource-aware performance predictor (RaPP) to model uncertainty in the vast fine-grained configuration space; and (3) an adaptive hybrid scaling algorithm jointly optimizing horizontal and vertical scaling. Evaluation shows that, compared to mainstream serverless platforms, our method reduces average function cost by 10.8× while significantly improving SLO compliance. Against state-of-the-art spatiotemporal sharing frameworks, it reduces SLO violations by 4.8× and lowers cost by 1.72×.

Technology Category

Application Category

📝 Abstract

Serverless Computing (FaaS) has become a popular paradigm for deep learning inference due to the ease of deployment and pay-per-use benefits. However, current serverless inference platforms encounter the coarse-grained and static GPU resource allocation problems during scaling, which leads to high costs and Service Level Objective (SLO) violations in fluctuating workloads. Meanwhile, current platforms only support horizontal scaling for GPU inferences, thus the cold start problem further exacerbates the problems. In this paper, we propose HAS-GPU, an efficient Hybrid Auto-scaling Serverless architecture with fine-grained GPU allocation for deep learning inferences. HAS-GPU proposes an agile scheduler capable of allocating GPU Streaming Multiprocessor (SM) partitions and time quotas with arbitrary granularity and enables significant vertical quota scalability at runtime. To resolve performance uncertainty introduced by massive fine-grained resource configuration spaces, we propose the Resource-aware Performance Predictor (RaPP). Furthermore, we present an adaptive hybrid auto-scaling algorithm with both horizontal and vertical scaling to ensure inference SLOs and minimize GPU costs. The experiments demonstrated that compared to the mainstream serverless inference platform, HAS-GPU reduces function costs by an average of 10.8x with better SLO guarantees. Compared to state-of-the-art spatio-temporal GPU sharing serverless framework, HAS-GPU reduces function SLO violation by 4.8x and cost by 1.72x on average.

Problem

Research questions and friction points this paper is trying to address.

Coarse-grained GPU allocation in serverless causes high costs

Static GPU scaling violates SLOs under fluctuating workloads

Cold starts worsen performance in horizontal-only GPU scaling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained GPU SM and time quota allocation

Resource-aware Performance Predictor (RaPP)

Hybrid auto-scaling with horizontal and vertical scaling

🔎 Similar Papers

FaaSwap: SLO-Aware, GPU-Efficient Serverless Inference via Model Swapping