IPR: Intelligent Prompt Routing with User-Controlled Quality-Cost Trade-offs

📅 2025-09-07

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

To address the cost–quality trade-off in large-scale commercial LLM deployments, this paper proposes an intelligent prompt routing framework. The method employs a frozen encoder coupled with a lightweight model adapter to build a scalable quality prediction module—trained on 1.5 million annotated samples—and supports user-defined quality tolerance τ. It integrates multi-model response evaluation with a dynamic routing algorithm, validated at industrial scale on the IPRBench benchmark. Deployed on mainstream cloud platforms, the system achieves a 43.9% reduction in inference cost while maintaining quality comparable to Claude’s strongest model, all under a request latency of <150 ms. Key contributions include: (1) a modular architecture decoupling quality estimation from LLM execution; (2) a user-controllable, fine-grained cost–quality trade-off mechanism; and (3) an efficient, production-ready quality prediction paradigm enabling real-time, adaptive routing.

Technology Category

Application Category

📝 Abstract

Routing incoming queries to the most cost-effective LLM while maintaining response quality poses a fundamental challenge in optimizing performance-cost trade-offs for large-scale commercial systems. We present IPR, a quality-constrained Intelligent Prompt Routing framework that dynamically selects optimal models based on predicted response quality and user-specified tolerance levels. IPR introduces three key innovations: (1) a modular architecture with lightweight quality estimators trained on 1.5M prompts annotated with calibrated quality scores, enabling fine-grained quality prediction across model families; (2) a user-controlled routing mechanism with tolerance parameter $τin [0,1]$ that provides explicit control over quality-cost trade-offs; and (3) an extensible design using frozen encoders with model-specific adapters, reducing new model integration from days to hours. To rigorously train and evaluate IPR, we curate an industrial-level dataset IPRBenchfootnote{IPRBench will be released upon legal approval.}, a comprehensive benchmark containing 1.5 million examples with response quality annotations across 11 LLM candidates. Deployed on a major cloud platform, IPR achieves 43.9% cost reduction while maintaining quality parity with the strongest model in the Claude family and processes requests with sub-150ms latency.

Problem

Research questions and friction points this paper is trying to address.

Routing queries to cost-effective LLMs while maintaining quality

Optimizing performance-cost trade-offs in large-scale commercial systems

Dynamically selecting optimal models based on predicted response quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight quality estimators for fine-grained prediction

User-controlled routing with tolerance parameter

Extensible design using frozen encoders and adapters

🔎 Similar Papers

No similar papers found.