Circinus: Efficient Query Planner for Compound ML Serving

๐Ÿ“… 2025-04-23
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This paper addresses real-time query planning for multi-operator AI pipelines in edge-cloud collaborative environments under concurrent Service-Level Objective (SLO) constraints on latency, accuracy, and cost. Challenges include heterogeneous edge capabilities, diverse SLO requirements, and a high-dimensional optimization space. Method: We propose the first decomposition-based planning framework supporting joint multi-query and multi-dimensional SLO optimization. It introduces an incremental pruning mechanism leveraging plan similarity and an accuracy-aware early-termination profiling method for rapid performance estimation. Results: Experiments demonstrate a 3.2โ€“5.0ร— improvement in service goodput, planning latency reduced to sub-second levels (4.2โ€“5.8ร— speedup), and 3.2โ€“4.0ร— reduction in deployment costโ€”significantly outperforming single-tier baseline approaches.

Technology Category

Application Category

๐Ÿ“ Abstract
The rise of compound AI serving -- integrating multiple operators in a pipeline that may span edge and cloud tiers -- enables end-user applications such as autonomous driving, generative AI-powered meeting companions, and immersive gaming. Achieving high service goodput -- i.e., meeting service level objectives (SLOs) for pipeline latency, accuracy, and costs -- requires effective planning of operator placement, configuration, and resource allocation across infrastructure tiers. However, the diverse SLO requirements, varying edge capabilities, and high query volumes create an enormous planning search space, rendering current solutions fundamentally limited for real-time serving and cost-efficient deployments. This paper presents Circinus, an SLO-aware query planner for large-scale compound AI workloads. Circinus novelly decomposes multi-query planning and multi-dimensional SLO objectives while preserving global decision quality. By exploiting plan similarities within and across queries, it significantly reduces search steps. It further improves per-step efficiency with a precision-aware plan profiler that incrementally profiles and strategically applies early stopping based on imprecise estimates of plan performance. At scale, Circinus selects query-plan combinations to maximize global SLO goodput. Evaluations in real-world settings show that Circinus improves service goodput by 3.2-5.0$ imes$, accelerates query planning by 4.2-5.8$ imes$, achieving query response in seconds, while reducing deployment costs by 3.2-4.0$ imes$ over state of the arts even in their intended single-tier deployments.
Problem

Research questions and friction points this paper is trying to address.

Optimizing compound AI serving pipelines across edge and cloud tiers
Planning operator placement and resource allocation for diverse SLOs
Reducing search space for real-time query planning efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decomposes multi-query planning and SLO objectives
Reduces search steps via plan similarities
Uses precision-aware plan profiler for efficiency
๐Ÿ”Ž Similar Papers
No similar papers found.
B
Banruo Liu
University of Illinois Urbana-Champaign
W
Wei-Yu Lin
University of Illinois Urbana-Champaign
M
Minghao Fang
University of Illinois Urbana-Champaign
Yihan Jiang
Yihan Jiang
Amazon AGI
LLMLLM agentFederated Learning
Fan Lai
Fan Lai
University of Illinois Urbana-Champaign
Machine Learning SystemsCloud ComputingMachine Learning