🤖 AI Summary
This work addresses the challenge of dynamically selecting the optimal large language model (LLM) in multi-model deployment scenarios to balance performance and inference cost. The authors propose OrcaRouter, a dynamic routing method based on the LinUCB contextual bandit framework, which innovatively integrates a hybrid learning paradigm combining offline full-information feedback with online feedback-driven updates. OrcaRouter leverages lexical and sentence-level embedding features to enable context-aware routing decisions and supports continual online learning post-deployment. Evaluated on the RouterArena benchmark, OrcaRouter achieves a score of 72.08 (ranking second), an accuracy of 75.54%, and a remarkably low inference cost of just \$1 per thousand queries, demonstrating its effectiveness in enabling high-performance, cost-efficient LLM routing.
📝 Abstract
The rapid development of large language models, each with distinct capabilities and inference costs, raises a practical deployment question: given an incoming request, which model should handle it? We present OrcaRouter, a production-oriented LLM router that combines a LinUCB-based contextual bandit over lexical and sentence-embedding features with a hybrid offline-online learning protocol. Offline, OrcaRouter obtains full-information feedback by evaluating each candidate model on a curated set of routing prompts, yielding a reward matrix used to fit one ridge regressor per arm. At deployment time, it initializes from these parameters and can optionally continue learning from bandit feedback, updating only the selected model's arm after observing its reward. At the time of our RouterArena submission (May 20, 2026), OrcaRouter-Adaptive ranked second on the public RouterArena leaderboard with an arena score of 72.08, achieving 75.54% accuracy at a cost of USD 1.00 per 1,000 queries.