DeepFlow: Serverless Large Language Model Serving at Scale

📅 2025-01-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address inefficient resource allocation, high inference latency, and slow cold starts in large language model (LLM) cloud services, this paper introduces DeepFlow—the first serverless AI platform tailored for Ascend NPU clusters. Methodologically, DeepFlow features: (1) a three-tier abstraction model (request–job–task); (2) FlowServe, a microkernel-style, NPU-native service engine; (3) the first PD-decoupled and co-located collaborative scheduling strategy; and (4) integrated optimizations including DRAM preloading, NPU-fork, and warm-up Pods. Deployed on large-scale Ascend clusters, DeepFlow has sustained industrial-grade operation for over one year, supporting fine-tuning, agent execution, and model-serving APIs. It achieves second-level elastic scaling across up to 64 instances, reduces end-to-end latency by 42%, and cuts resource overhead by 37%.

Technology Category

Application Category

📝 Abstract
This paper introduces DeepFlow, a scalable and serverless AI platform designed to efficiently serve large language models (LLMs) at scale in cloud environments. DeepFlow addresses key challenges such as resource allocation, serving efficiency, and cold start latencies through four main design components. First, it uses a simple serverless abstraction called the request-job-task model, which helps manage AI workloads across post-training and model serving tasks. Second, it builds an in-house serving engine FlowServe using a microkernel-inspired design, NPU-centric execution, and SPMD-based parallelism to optimize LLM serving. The system also includes novel scheduling policies tailored for both PD-disaggregated and PD-colocated configurations. With optimizations like pre-warmed pods, DRAM pre-loading, and NPU-fork, DeepFlow can scale up to 64 instances in seconds. DeepFlow has been in production for over a year, operating on a large Ascend NPU cluster and providing industrystandard APIs for fine-tuning, agent serving, and model serving to our customers.
Problem

Research questions and friction points this paper is trying to address.

Flexible Service
Large Language Model
Resource Allocation
Innovation

Methods, ideas, or system contributions that make the work stand out.

cloud-native language model services
FlowServe engine with NPU
adaptive scheduling strategies
🔎 Similar Papers
No similar papers found.
J
Junhao Hu
Huawei Cloud
J
Jiang Xu
Huawei Cloud
Yulong He
Yulong He
St Petersburg University
Y
Yuetao Chen
Huawei Cloud
G
Gengyuan Dan
Huawei Cloud
Z
Zhixia Liu
Huawei Cloud
B
Baoquan Zhang
Huawei Cloud
S
Shining Wan
Huawei Cloud
Z
Zhiyu Dong
Huawei Cloud
H
Hao Xu
Huawei Cloud
Z
Zhihao Ren
Huawei Cloud
J
Jiang Liu
Huawei Cloud
J
Jie Meng
Huawei Cloud
C
Chao He
Huawei Cloud
T
Tao Xie
Peking University
D
Dayun Lin
Huawei Cloud
Q
Qin Zhang
Huawei Cloud
Y
Yue Yu
Huawei Cloud
H
Hao Feng
Huawei Cloud
Xusheng Chen
Xusheng Chen
Huawei Cloud
Distributed SystemsCloud ComputingDistributed Databases
Yizhou Shan
Yizhou Shan
Huawei Cloud
DisaggregationOperating SystemDistributed SystemComputer Architecture