🤖 AI Summary
To address the cost–quality trade-off in large-scale commercial LLM deployments, this paper proposes an intelligent prompt routing framework. The method employs a frozen encoder coupled with a lightweight model adapter to build a scalable quality prediction module—trained on 1.5 million annotated samples—and supports user-defined quality tolerance τ. It integrates multi-model response evaluation with a dynamic routing algorithm, validated at industrial scale on the IPRBench benchmark. Deployed on mainstream cloud platforms, the system achieves a 43.9% reduction in inference cost while maintaining quality comparable to Claude’s strongest model, all under a request latency of <150 ms. Key contributions include: (1) a modular architecture decoupling quality estimation from LLM execution; (2) a user-controllable, fine-grained cost–quality trade-off mechanism; and (3) an efficient, production-ready quality prediction paradigm enabling real-time, adaptive routing.
📝 Abstract
Routing incoming queries to the most cost-effective LLM while maintaining response quality poses a fundamental challenge in optimizing performance-cost trade-offs for large-scale commercial systems. We present IPR, a quality-constrained Intelligent Prompt Routing framework that dynamically selects optimal models based on predicted response quality and user-specified tolerance levels. IPR introduces three key innovations: (1) a modular architecture with lightweight quality estimators trained on 1.5M prompts annotated with calibrated quality scores, enabling fine-grained quality prediction across model families; (2) a user-controlled routing mechanism with tolerance parameter $τin [0,1]$ that provides explicit control over quality-cost trade-offs; and (3) an extensible design using frozen encoders with model-specific adapters, reducing new model integration from days to hours. To rigorously train and evaluate IPR, we curate an industrial-level dataset IPRBenchfootnote{IPRBench will be released upon legal approval.}, a comprehensive benchmark containing 1.5 million examples with response quality annotations across 11 LLM candidates. Deployed on a major cloud platform, IPR achieves 43.9% cost reduction while maintaining quality parity with the strongest model in the Claude family and processes requests with sub-150ms latency.