🤖 AI Summary
This work addresses the fundamental trade-off between performance and efficiency in large language model (LLM) inference. We propose Avengers-Pro, a dynamic routing framework that leverages query embeddings and clustering to orchestrate heterogeneous LLMs—varying in capacity and computational efficiency—to jointly optimize the accuracy-cost Pareto frontier online. Its core contribution is the first unified routing paradigm enabling arbitrary accuracy-efficiency trade-offs, and the first demonstration of multi-model ensembles strictly dominating the best single model in both accuracy and cost. Evaluated across six benchmarks, Avengers-Pro achieves an average 7% higher accuracy than the strongest single model (GPT-5-medium), or matches its accuracy at 27% lower inference cost; under extreme compression (63% cost reduction), it retains 90% of the original model’s performance.
📝 Abstract
Balancing performance and efficiency is a central challenge in large language model (LLM) advancement. GPT-5 addresses this with test-time routing, dynamically assigning queries to either an efficient or a high-capacity model during inference. In this work, we present Avengers-Pro, a test-time routing framework that ensembles LLMs of varying capacities and efficiencies, providing a unified solution for all performance-efficiency tradeoffs. The Avengers-Pro embeds and clusters incoming queries, then routes each to the most suitable model based on a performance-efficiency score. Across 6 challenging benchmarks and 8 leading models -- including GPT-5-medium, Gemini-2.5-pro, and Claude-opus-4.1 -- Avengers-Pro achieves state-of-the-art results: by varying a performance-efficiency trade-off parameter, it can surpass the strongest single model (GPT-5-medium) by +7% in average accuracy. Moreover, it can match the average accuracy of the strongest single model at 27% lower cost, and reach ~90% of that performance at 63% lower cost. Last but not least, it achieves a Pareto frontier, consistently yielding the highest accuracy for any given cost, and the lowest cost for any given accuracy, among all single models. Code is available at https://github.com/ZhangYiqun018/AvengersPro.