Cutting LLM Evaluation Costs with SySRs: A Bandit Algorithm that Provably Exploits Model Similarity

📅 2026-06-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational cost of evaluating large language models, which traditionally requires exhaustive testing across all candidates. The authors propose Synchronous Sequential Rejection (SySRs), a novel method that, for the first time, leverages response similarity to construct pairwise comparisons without requiring any hyperparameters. Operating within a multi-armed bandit framework, SySRs adaptively allocates evaluation budgets while guaranteeing identification of the best-performing model and providing theoretical performance bounds. Empirical results demonstrate that SySRs achieves lower average error rates than all baseline methods across 15 standard benchmarks and consistently requires fewer evaluation resources in the worst case.
📝 Abstract
Large Language Models are typically benchmarked by evaluating every model on every test query. For practitioners seeking the best model to deploy, this is often wasteful: if a model clearly performs worse than others, there is no need to precisely estimate its performance. Best-arm identification algorithms can be naturally applied to drastically reduce costs by adaptively allocating evaluation budget. Further, language models often respond similarly to the same prompt-a property previous work has tried to leverage with mixed success. We propose Synchronized Successive Rejects (SySRs), augmenting the classical Successive Rejects algorithm with paired comparisons. Unlike prior attempts to leverage model similarity in best-model identification, our approach is hyperparameter-free and enjoys performance guarantees that improve with the degree of similarity between evaluated models. Empirically, our method outperforms all baselines in terms of average error rate across 15 standard benchmarks, and in terms of worst-case budget for reliably identifying the best model.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Model Evaluation
Best-arm Identification
Evaluation Cost
Model Similarity
Innovation

Methods, ideas, or system contributions that make the work stand out.

best-arm identification
model similarity
adaptive evaluation
bandit algorithm
large language models
🔎 Similar Papers
2024-07-08arXiv.orgCitations: 1