AGENTSERVESIM: A Hardware-aware Simulator for Multi-Turn LLM Agent Serving

📅 2026-06-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing systems struggle to efficiently evaluate scheduling, KV cache management, and routing strategies in multi-turn LLM agent services. To address this gap, this work proposes the first hardware-aware simulator capable of program-level context modeling, accurately capturing multi-turn interactions, tool invocation gaps, and cross-turn KV cache locality—thereby enabling high-fidelity simulation of stateful LLM serving. The system features a composable modular architecture integrating a program orchestrator, a tool simulator, a session-aware router, and a KV residency model, supporting multi-tier memory simulation across HBM, DRAM, and CXL. Experimental results demonstrate that, under diverse real-world deployment scenarios and hardware configurations, the simulator achieves prediction errors below 6% for key performance metrics while running on commodity CPUs, enabling cost-effective and accurate exploration of system design strategies.

📝 Abstract

Multi-turn LLM agents interleave model calls with external tool invocations, shifting serving from stateless request processing to stateful program execution. Serving these workloads requires scheduling, KV-cache management, and routing policies that use program-level context, including turn dependencies, tool-induced gaps, and reusable KV state. Evaluating such policies directly on real systems is costly, since each design point may require dedicated accelerator time across arrival rates, model scales, serving-instance counts, and memory hierarchies. Simulation offers a scalable alternative, but existing LLM serving simulators target stateless request-level workloads and therefore omit the core dynamics of agent serving: multi-turn program execution, cross-turn cache locality, and KV-cache residency during tool gaps. We present AGENTSERVESIM, a hardware-aware simulator for multi-turn LLM agent serving. AGENTSERVESIM evaluates serving policies at program granularity through composable modules: a Program Orchestrator preserves program identity and turn order, a Tool Simulator materializes tool-induced gaps, a Session-Aware Router maintains program-to-instance affinity for cache-aware dispatch, and a KV Residency Model tracks policy-defined KV placement across HBM, host DRAM/CXL, and eviction. Across real serving deployments and hardware configurations, AGENTSERVESIM reproduces real-system behavior within 6% error across key performance metrics while running entirely on commodity CPUs. These results show that AGENTSERVESIM enables controlled, repeatable exploration of agent-serving policies without requiring exhaustive deployment on costly accelerators.

Problem

Research questions and friction points this paper is trying to address.

multi-turn LLM agents

KV-cache management

hardware-aware simulation

stateful serving

tool-induced gaps

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-turn LLM agents

hardware-aware simulation

KV-cache management