🤖 AI Summary
This work addresses the query allocation problem when deploying multiple large language models (LLMs) in parallel for classification, aiming to minimize total query cost while satisfying reliability constraints for each class. The problem is formulated as an offline optimization task with state-dependent error constraints. We propose the first proxy optimization framework that simultaneously guarantees feasibility and asymptotic tightness. By integrating union bound decomposition, Chernoff-type concentration inequalities, and combinatorial optimization techniques, we construct a closed-form separable surrogate objective, enabling the design of an asymptotic fully polynomial-time approximation scheme (AFPTAS). Theoretical analysis shows that the ratio between the surrogate cost and the true optimal cost converges to one at an explicit rate as the error tolerance approaches zero, thereby achieving a (1+ε)-approximation guarantee.
📝 Abstract
Deploying multiple large language models (LLMs) in parallel to classify an unknown ground-truth label is a common practice, yet the problem of optimally allocating queries across heterogeneous models remains poorly understood. In this paper, we formulate a robust, offline query-planning problem that minimizes total query cost subject to statewise error constraints which guarantee reliability for every possible ground-truth label. We first establish that this problem is NP-hard via a reduction from the minimum-weight set cover problem. To overcome this intractability, we develop a surrogate by combining a union bound decomposition of the multi-class error into pairwise comparisons with Chernoff-type concentration bounds. The resulting surrogate admits a closed-form, multiplicatively separable expression in the query counts and is guaranteed to be feasibility-preserving. We further show that the surrogate is asymptotically tight at the optimization level: the ratio of surrogate-optimal cost to true optimal cost converges to one as error tolerances shrink, with an explicit rate of $O\left(\log\log(1/α_{\min}) / \log(1/α_{\min})\right)$. Finally, we design an asymptotic fully polynomial-time approximation scheme (AFPTAS) that returns a surrogate-feasible query plan within a $(1+\varepsilon)$ factor of the surrogate optimum.