🤖 AI Summary
To address high cost and SLO violations in serverless deep learning inference caused by coarse-grained, static GPU resource allocation, this paper proposes the first hybrid elastic scheduling architecture supporting SM-level fine-grained vertical scaling and cold-start optimization. Our approach tackles three key challenges: (1) runtime, arbitrary-granularity partitioning of GPU Streaming Multiprocessors (SMs) with dynamic temporal quota assignment; (2) a resource-aware performance predictor (RaPP) to model uncertainty in the vast fine-grained configuration space; and (3) an adaptive hybrid scaling algorithm jointly optimizing horizontal and vertical scaling. Evaluation shows that, compared to mainstream serverless platforms, our method reduces average function cost by 10.8× while significantly improving SLO compliance. Against state-of-the-art spatiotemporal sharing frameworks, it reduces SLO violations by 4.8× and lowers cost by 1.72×.
📝 Abstract
Serverless Computing (FaaS) has become a popular paradigm for deep learning inference due to the ease of deployment and pay-per-use benefits. However, current serverless inference platforms encounter the coarse-grained and static GPU resource allocation problems during scaling, which leads to high costs and Service Level Objective (SLO) violations in fluctuating workloads. Meanwhile, current platforms only support horizontal scaling for GPU inferences, thus the cold start problem further exacerbates the problems. In this paper, we propose HAS-GPU, an efficient Hybrid Auto-scaling Serverless architecture with fine-grained GPU allocation for deep learning inferences. HAS-GPU proposes an agile scheduler capable of allocating GPU Streaming Multiprocessor (SM) partitions and time quotas with arbitrary granularity and enables significant vertical quota scalability at runtime. To resolve performance uncertainty introduced by massive fine-grained resource configuration spaces, we propose the Resource-aware Performance Predictor (RaPP). Furthermore, we present an adaptive hybrid auto-scaling algorithm with both horizontal and vertical scaling to ensure inference SLOs and minimize GPU costs. The experiments demonstrated that compared to the mainstream serverless inference platform, HAS-GPU reduces function costs by an average of 10.8x with better SLO guarantees. Compared to state-of-the-art spatio-temporal GPU sharing serverless framework, HAS-GPU reduces function SLO violation by 4.8x and cost by 1.72x on average.