🤖 AI Summary
Current large language model (LLM) routing methods rely on single-generation outputs as capability labels, ignoring the inherent stochasticity in model predictions and thereby introducing noisy supervision signals that compromise routing reliability. To address this limitation, this work proposes DARS, a novel framework that models LLM capabilities as probability distributions rather than point estimates. By applying semantically equivalent perturbations to inputs and sampling multiple outputs, DARS jointly captures uncertainties in both input representations and model responses, enabling the construction of distribution-aware routing supervision signals. Experimental results demonstrate that, compared to conventional single-sample labeling approaches, DARS significantly enhances both the stability and accuracy of routing decisions.
📝 Abstract
Existing LLM routing methods typically treat a model's single response to a query as its capability label for training routers. However, because LLM generation is inherently stochastic, such single-shot supervision provides only a noisy observation of a query-model pair's behavior rather than a reliable capability estimate. We show that this assumption introduces systematic noise into routing supervision, making learned routing policies less reliable. To address this issue, we propose DARS (Distribution-Aware Routing Supervision), a framework that constructs routing supervision from a distributional view of model behavior. Instead of relying on a single generated response, DARS considers uncertainty from both the input side and the output side, capturing how semantically equivalent query formulations and stochastic generations affect model performance. Based on these distribution-aware observations, DARS builds more reliable supervision signals for routing. Experiments across diverse tasks show that single-shot labels can be misleading for model selection, while distribution-aware supervision provides more stable labels and improves learned routing behavior. Our results suggest that reliable LLM routing should move beyond single-response observations and be grounded in query-level model capability distributions.