🤖 AI Summary
In budget-constrained large language model (LLM)-based multi-agent systems (MAS), jointly optimizing token cost and end-to-end latency remains challenging. Method: This paper proposes a novel “backbone-first, then topology-optimization” paradigm, introducing the first unified formulation that jointly models token consumption and end-to-end latency budgets. It enables co-optimization of heterogeneous LLM backbone selection, role-backbone alignment, agent representation learning, gated communication mechanisms, and latency-aware topology synthesis. Results: Evaluated across 14 candidate LLMs, our method achieves up to 10% higher task performance under fixed token budgets and up to 22% improvement under fixed latency budgets, with significantly better AUC. The approach supports plug-and-play deployment and cross-LLM generalization, providing a scalable, cost-effective design framework for large-scale MAS.
📝 Abstract
Large Language Model (LLM)-based multi-agent systems (MAS) are becoming indispensable building blocks for web-scale applications such as web search, social network analytics, and online customer support, where cost-effectiveness is increasingly the primary constraint for large-scale deployment. While recent work improves MAS cost-effectiveness by shaping inter-agent communication topologies and selecting agent backbones, it rarely models and optimizes under explicit token-cost and latency budgets that reflect deployment constraints. This often leads to topology-first designs and suboptimal cost-effectiveness when budgets are binding. We present AgentBalance, a framework for constructing cost-effective MAS under explicit token-cost and latency budgets via a backbone-then-topology design. AgentBalance first performs backbone-oriented agent generation, constructing agents with heterogeneous backbones through LLM pool construction, pool selection, and role-backbone matching. It then performs adaptive MAS topology generation, guiding inter-agent communication via agent representation learning, gating, and latency-aware topology synthesis. Experiments on benchmarks with 14 candidate LLM backbones show that AgentBalance achieves up to 10% and 22% performance gains under matched token-cost and latency budgets, respectively, and yields strong AUC on performance-versus-budget curves across benchmarks. AgentBalance also functions as a plug-in for existing MAS, improving performance under the same token-cost and latency constraints, and it generalizes well to unseen LLMs for practical, budget-aware deployment. Code: https://github.com/usail-hkust/AgentBalance