🤖 AI Summary
This work addresses the performance–cost trade-off in large language model (LLM) query routing by proposing the first lightweight, cost-aware router with provable minimax rate optimality. Methodologically, it introduces a dynamic routing policy grounded in joint performance–cost estimation and constructs SPROUT—a comprehensive benchmark comprising diverse real-world queries and state-of-the-art LLMs. Theoretical analysis is established within the statistical learning framework of rate optimality. Empirical evaluation integrates SPROUT, RouterBench, and Open LLM Leaderboard v2. Results demonstrate that the router reduces average invocation cost by 23%–41% while preserving response quality, with inference latency under 5 ms—significantly outperforming existing approaches.
📝 Abstract
With the rapid growth in the number of Large Language Models (LLMs), there has been a recent interest in LLM routing, or directing queries to the cheapest LLM that can deliver a suitable response. Following this line of work, we introduce CARROT, a Cost AwaRe Rate Optimal rouTer that can select models based on any desired trade-off between performance and cost. Given a query, CARROT selects a model based on estimates of models' cost and performance. Its simplicity lends CARROT computational efficiency, while our theoretical analysis demonstrates minimax rate-optimality in its routing performance. Alongside CARROT, we also introduce the Smart Price-aware Routing (SPROUT) dataset to facilitate routing on a wide spectrum of queries with the latest state-of-the-art LLMs. Using SPROUT and prior benchmarks such as Routerbench and open-LLM-leaderboard-v2 we empirically validate CARROT's performance against several alternative routers.