🤖 AI Summary
This work addresses the limitation of existing large language model (LLM) multi-agent orchestration approaches, which ignore runtime infrastructure states, leading to imbalanced resource utilization and accumulated latency in shared GPU clusters. To overcome this, we propose the first infrastructure-aware multi-agent orchestration framework that integrates system-level signals—such as queue depth, KV cache pressure, and latency—into a three-tier decision pipeline encompassing planning, routing, and scheduling for end-to-end co-optimization. Built upon a hierarchical constrained Markov decision process, our framework jointly trains an infrastructure-aware planner, executor, and budget-aware scheduler via reinforcement learning. Experiments across five benchmarks demonstrate that under low load, our method improves accuracy by up to 7.6 percentage points and reduces latency by 7×; under high load, it achieves a 99.9% service-level objective (SLO) compliance rate, significantly outperforming current state-of-the-art approaches.
📝 Abstract
Existing multi-agent LLM orchestration methods, ranging from brute-force ensembles to learned routers, select models and topologies based on task and model features. However, these methods do not consider the runtime state of the serving infrastructure. On shared GPU clusters under concurrent load, this infrastructure blindness causes systematic resource underutilization: preferred models accumulate deep request queues while equally capable alternatives sit idle. In multi-agent pipelines, where each query triggers multiple sequential model calls, these delays then compound across every downstream step. Closing this gap is challenging because the relevant infrastructure signals (queue depths, KV-cache pressure, latencies) are dynamic and noisy, and they must drive three different decisions: planning, per-step routing, and scheduling. We introduce INFRAMIND, a framework that makes the entire multi-agent stack infrastructure-aware. An infra-aware planner conditions topology and role selection on real-time system load and remaining budget, biasing toward simpler graphs under congestion and richer ones at low load. An infra-aware executor then observes per-model queue depths, cache utilization, and response latencies at each agent step to decide which model to call and how deeply to reason; a budget-aware scheduler further reorders each model's queue so that urgent requests are served first. Cast as a hierarchical constrained MDP and solved end-to-end via reinforcement learning, the system learns to balance quality against latency automatically. Across five benchmarks, INFRAMIND delivers up to +7.6 pp accuracy over the prior baseline at low load with up to 7x lower latency, and sustains up to 99.9% SLO compliance under high load where every baseline drops below 50%.