Valid Best-Model Identification for LLM Evaluation via Low-Rank Factorization

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

Evaluating large language models on fixed benchmarks is computationally expensive, and existing acceleration methods based on low-rank prediction often introduce bias, leading to incorrect model selection. This work proposes a novel framework that integrates multi-armed bandits with low-rank matrix factorization, employing doubly robust estimation to significantly reduce the number of model evaluations while preserving statistical validity. For the first time, we establish finite-sample valid confidence intervals under adaptive model selection with sampling without replacement, achieving a principled balance between efficiency and accuracy. Empirical results demonstrate that the proposed method substantially lowers evaluation costs on real-world benchmarks while reliably identifying the best-performing model.

📝 Abstract

Selecting the best large language model (LLM) for a fixed benchmark is often expensive, since exhaustive evaluation requires running every model on every example. Multi-armed bandit (MAB) algorithms can reduce the number of LLM calls by sequentially selecting the next model-example pair to evaluate, thereby avoiding wasted evaluations on clearly underperforming models. Further savings can be achieved by predicting model scores from the partially observed model-example score matrix using low-rank factorization. However, such predictions are not ground truth: they can be biased and may therefore lead to incorrect identification of the best model. In this work, we propose a principled framework that combines MAB with cheap predicted scores without compromising statistical validity. Specifically, we derive doubly robust estimators of each model's performance that use the low-rank predictions to reduce variance. This enables the construction of valid finite-sample confidence intervals in our setting, where models are selected adaptively and examples are sampled without replacement. Empirical results on real-world benchmarks show that our approach reduces the number of required evaluations, yielding meaningful savings in compute and cost while accurately identifying the best-performing model.

Problem

Research questions and friction points this paper is trying to address.

large language model

model evaluation

low-rank factorization

statistical validity

best-model identification

Innovation

Methods, ideas, or system contributions that make the work stand out.

low-rank factorization

multi-armed bandit

doubly robust estimation