🤖 AI Summary
This work addresses the limited capacity of autonomous agents to perform structured reasoning and collaborative optimization in large, stateful action spaces. It introduces Arbor, a multi-agent framework that uniquely employs structured tree search as its cognitive core, explicitly maintaining a shared working memory in the form of a scored hypothesis tree to enable cross-domain expert agents to collaboratively explore and dynamically refine strategies through failure-driven diagnosis. Arbor features a balanced Orchestrator–Critic architecture that differentiates hard skills from soft skills, ensuring long-term operational stability. Evaluated on full-stack LLM inference optimization, Arbor achieves up to a 193% Pareto improvement in throughput–latency trade-offs, substantially outperforming single-agent baselines (by +33%, which also suffer frequent failures), while demonstrating cross-generational hardware generalization and inter-run variance below 2%.
📝 Abstract
Arbor is a multi-agent framework that introduces structured tree search as a cognition layer for autonomous agents operating in large, stateful action spaces. Prior autonomous optimization systems operate on isolated targets with stateless evaluation. Arbor instead maintains an explicit search tree of scored hypotheses that serves as the shared working memory across agents, evolving with every measurement, treating failures as diagnostic signal that reshapes subsequent exploration, and expanding as prior successes shift the bottleneck distribution.
We validate Arbor on full-stack LLM inference optimization, a domain where achieving peak performance has historically required coordinated effort from engineering teams across the application, framework, compiler, kernel, and hardware stack. Arbor pairs an Orchestrator agent, which drives optimization by delegating to Domain Specialists across the inference stack, with a Critic agent that safeguards stability through root-cause analysis, introspection, and measurement validation -- a checks-and-balances architecture where neither agent can unilaterally drive the system. Agent capabilities are decomposed into hard skills (domain expertise) and soft skills (coordination protocols that determine how contributions compose), enabling fully autonomous multi-day campaigns. Arbor achieves up to 193% inference throughput-latency Pareto improvement over vendor-optimized baselines, while a single agent without the harness plateaus at +33% throughput improvement and crashes irrecoverably within hours. Arbor generalizes to multiple generations of hardware platform, and run-to-run variance is within 2 percentage points demonstrating that the method is hardware-agnostic and reproducible.