Cloud Native System for LLM Inference Serving

📅 2025-07-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address core challenges in cloud-based large language model (LLM) inference—including low resource utilization, high latency, and poor elasticity—this paper proposes a cloud-native inference service architecture. The architecture tightly integrates containerization, microservices, and dynamic resource scheduling, leveraging Kubernetes for fine-grained, automated scaling. It further introduces a request dispatching and resource matching algorithm specifically designed for LLM inference workload characteristics. Our key contribution lies in the co-optimization of service orchestration, scheduling policies, and LLM-specific computational traits, thereby overcoming limitations of conventional static deployment paradigms. Experimental evaluation demonstrates significant improvements over baseline approaches: end-to-end latency is reduced by 37.2%, throughput increases by 2.1×, and resource utilization improves by 58%. These gains collectively enhance service stability and cost-efficiency in production-grade LLM inference deployments.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are revolutionizing numerous industries, but their substantial computational demands create challenges for efficient deployment, particularly in cloud environments. Traditional approaches to inference serving often struggle with resource inefficiencies, leading to high operational costs, latency issues, and limited scalability. This article explores how Cloud Native technologies, such as containerization, microservices, and dynamic scheduling, can fundamentally improve LLM inference serving. By leveraging these technologies, we demonstrate how a Cloud Native system enables more efficient resource allocation, reduces latency, and enhances throughput in high-demand scenarios. Through real-world evaluations using Kubernetes-based autoscaling, we show that Cloud Native architectures can dynamically adapt to workload fluctuations, mitigating performance bottlenecks while optimizing LLM inference serving performance. This discussion provides a broader perspective on how Cloud Native frameworks could reshape the future of scalable LLM inference serving, offering key insights for researchers, practitioners, and industry leaders in cloud computing and artificial intelligence.
Problem

Research questions and friction points this paper is trying to address.

Efficient deployment of LLMs in cloud environments
Resource inefficiencies in traditional inference serving
Dynamic workload adaptation for LLM inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Containerization and microservices for LLM serving
Dynamic scheduling to optimize resource allocation
Kubernetes-based autoscaling for workload adaptation
🔎 Similar Papers
No similar papers found.
Minxian Xu
Minxian Xu
Associate Professor, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
Cloud ComputingMicroservicesLLM Inference
J
Junhan Liao
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences, China
Jingfeng Wu
Jingfeng Wu
University of California, Berkeley
deep learning theorymachine learningoptimizationstatistical learning theory
Y
Yiyuan He
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences, China
Kejiang Ye
Kejiang Ye
Professor, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
Cloud ComputingAI SystemsIndustrial Internet
C
Chengzhong Xu
State Key Lab of IOTSC, University of Macau, China