🤖 AI Summary
This work addresses the significant redundancy in existing large language model (LLM) serving systems, which arises from ignoring cross-invocation dependencies in agent workflows and leads to repeated prompting and redundant intermediate results. For the first time, it introduces classical query optimization principles into LLM agent serving by modeling workflows as query plans from a data systems perspective, treating LLM invocations as first-class operators. The authors propose a workflow-aware serving framework that integrates workflow-level cache reuse, cache-aware scheduling, shared KV state management, and query-plan-driven execution to enable end-to-end optimization. Experimental results demonstrate that the system achieves up to 1.56× performance improvement over state-of-the-art baselines across diverse workloads, validating the effectiveness and novelty of the proposed approach.
📝 Abstract
Agentic workflows are composed of sequences of interdependent Large Language Model (LLM) calls, and they have become a dominant workload in modern AI systems. These workflows exhibit extensive redundancy from overlapping prompts and intermediate results due to speculative and parallel exploration. Existing LLM serving systems, such as vLLM, focus on optimizing individual inference calls and overlook cross-call dependencies, leading to significant inefficiencies. This paper rethinks LLM and agent serving from a data systems perspective and introduces Helium, a workflow-aware serving framework that models agentic workloads as query plans and treats LLM invocations as first-class operators. Helium integrates proactive caching and cache-aware scheduling to maximize reuse across prompts, KV states, and workflows. Through these techniques, Helium bridges classic query optimization principles with LLM serving, achieving up to 1.56x speedup over state-of-the-art agent serving systems on various workloads. Our results demonstrate that end-to-end optimization across workflows is essential for scalable and efficient LLM-based agents.