Online Pandora's Box for Contextual LLM Cascading

๐Ÿ“… 2026-06-05
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This study addresses the problem of efficiently and cost-effectively selecting the optimal output among multiple large language model (LLM) APIs in dynamic request contexts. Focusing on LLM cascading scenarios, the work proposes an online contextual Pandoraโ€™s Box framework, where in each round multiple APIs are queried sequentially for their outputs and associated costs, yet only one is deployed and its downstream reward observed. The key innovation lies in directly modeling context-dependent reservation indices rather than full distributions of outputs and costs, and designing a novel learning mechanism inspired by Weitzmanโ€™s strategy to accommodate the mediated feedback structure. By parameterizing the reservation index function and integrating generalized method of moments (GMM) with upper confidence bound (UCB) techniques, the approach jointly optimizes querying and selection decisions. Under standard regularity conditions, the proposed policy achieves a dimension-dependent $\widetilde{O}(\sqrt{T})$ cumulative regret bound over $T$ rounds.
๐Ÿ“ Abstract
Motivated by Large Language Model (LLM) cascading, we propose an online contextual Pandora's Box model for adaptively querying and selecting LLM APIs. In each period, a decision-maker observes a request context and faces a two-phase decision problem. In the query phase, the decision-maker sequentially queries APIs, where each query reveals a generated output and the decision-maker incurs an (output-dependent) cost. In the selection phase, the decision-maker selects one of the generated outputs to deploy and observes only the downstream reward of the deployed output. This output-mediated feedback structure differs from classical online contextual Pandora's Box models, in which opening a box directly reveals its reward. Rather than estimating the full conditional output and cost distributions of each API, we directly model the reservation index and develop a learning approach for the query phase. Specifically, we impose a parametric structure on the contextual reservation index functions induced by the classical Weitzman's policy. Our policy combines generalized method of moments (GMM) type estimation of these reservation indices with UCB-style confidence bounds for both these indices and the shared output-level reward evaluator. Under regularity conditions, we prove that the resulting policy achieves dimension-dependent $\widetilde O(\sqrt T)$ cumulative regret over a horizon of $T$ periods.
Problem

Research questions and friction points this paper is trying to address.

LLM cascading
online contextual Pandora's Box
adaptive querying
output-mediated feedback
reservation index
Innovation

Methods, ideas, or system contributions that make the work stand out.

contextual Pandora's Box
LLM cascading
reservation index
online learning
output-mediated feedback