Vortex: Hosting ML Inference and Knowledge Retrieval Services With Tight Latency and Throughput Requirements

๐Ÿ“… 2025-11-03
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing ML serving platforms rely on batch processing to improve throughput, but suffer from unpredictable tail latencyโ€”failing to meet the low-latency and high-determinacy SLO requirements of AI agents and interactive end-user applications. This paper proposes an SLO-first architecture for inference and knowledge retrieval, overcoming the fundamental tail-latency limitations inherent in conventional batching. Our approach introduces three core innovations: (1) fine-grained pipeline scheduling driven by SLO constraints; (2) dynamic request priority management; and (3) RDMA-accelerated dataflow optimization. Experiments demonstrate that, under identical workloads, our system significantly reduces and stabilizes tail latency compared to TorchServe and Ray Serve. Under the same SLO constraints, it supports over twice the request rate; this advantage is further amplified in RDMA-enabled environments. The proposed system establishes a robust infrastructure that simultaneously delivers high throughput and low latency for interactive AI services.

Technology Category

Application Category

๐Ÿ“ Abstract
There is growing interest in deploying ML inference and knowledge retrieval as services that could support both interactive queries by end users and more demanding request flows that arise from AIs integrated into a end-user applications and deployed as agents. Our central premise is that these latter cases will bring service level latency objectives (SLOs). Existing ML serving platforms use batching to optimize for high throughput, exposing them to unpredictable tail latencies. Vortex enables an SLO-first approach. For identical tasks, Vortex's pipelines achieve significantly lower and more stable latencies than TorchServe and Ray Serve over a wide range of workloads, often enabling a given SLO target at more than twice the request rate. When RDMA is available, the Vortex advantage is even more significant.
Problem

Research questions and friction points this paper is trying to address.

Optimizing ML inference services for strict latency requirements
Addressing unpredictable tail latencies in existing serving platforms
Enabling SLO-first approach for AI agent request flows
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vortex enables SLO-first serving approach
Achieves lower and more stable latencies
Leverages RDMA for enhanced performance
๐Ÿ”Ž Similar Papers
No similar papers found.
Y
Yuting Yang
Cornell University
T
Tiancheng Yuan
Cornell University
J
Jamal Hashim
Cornell University
T
Thiago Garrett
University of Oslo
J
Jeffrey Qian
Cornell University
A
Ann Zhang
Cornell University
Y
Yifan Wang
Cornell University
Weijia Song
Weijia Song
Research Associate, Cornell University
Distributed SystemCloud Computing
K
Ken Birman
Cornell University