Valid Best-Model Identification for LLM Evaluation via Low-Rank Factorization

📅 2026-05-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

233K/year
🤖 AI Summary
Evaluating large language models on fixed benchmarks is computationally expensive, and existing acceleration methods based on low-rank prediction often introduce bias, leading to incorrect model selection. This work proposes a novel framework that integrates multi-armed bandits with low-rank matrix factorization, employing doubly robust estimation to significantly reduce the number of model evaluations while preserving statistical validity. For the first time, we establish finite-sample valid confidence intervals under adaptive model selection with sampling without replacement, achieving a principled balance between efficiency and accuracy. Empirical results demonstrate that the proposed method substantially lowers evaluation costs on real-world benchmarks while reliably identifying the best-performing model.
📝 Abstract
Selecting the best large language model (LLM) for a fixed benchmark is often expensive, since exhaustive evaluation requires running every model on every example. Multi-armed bandit (MAB) algorithms can reduce the number of LLM calls by sequentially selecting the next model-example pair to evaluate, thereby avoiding wasted evaluations on clearly underperforming models. Further savings can be achieved by predicting model scores from the partially observed model-example score matrix using low-rank factorization. However, such predictions are not ground truth: they can be biased and may therefore lead to incorrect identification of the best model. In this work, we propose a principled framework that combines MAB with cheap predicted scores without compromising statistical validity. Specifically, we derive doubly robust estimators of each model's performance that use the low-rank predictions to reduce variance. This enables the construction of valid finite-sample confidence intervals in our setting, where models are selected adaptively and examples are sampled without replacement. Empirical results on real-world benchmarks show that our approach reduces the number of required evaluations, yielding meaningful savings in compute and cost while accurately identifying the best-performing model.
Problem

Research questions and friction points this paper is trying to address.

large language model
model evaluation
low-rank factorization
statistical validity
best-model identification
Innovation

Methods, ideas, or system contributions that make the work stand out.

low-rank factorization
multi-armed bandit
doubly robust estimation
LLM evaluation
statistical validity
🔎 Similar Papers
2024-07-08arXiv.orgCitations: 1