🤖 AI Summary
Comprehensive evaluation of large language models across numerous benchmarks is prohibitively expensive, making it challenging to obtain statistically reliable performance estimates under limited query budgets. This work addresses benchmarking as a finite population inference problem and proposes Factorized Active Querying (FAQ), a method that leverages a Bayesian factor model to decompose task structure, integrates variance reduction with active learning strategies, and introduces a Proactive Active Inference framework to ensure coverage validity. FAQ substantially improves sample efficiency with negligible computational overhead, achieving up to a fivefold effective sample gain over strong baselines on two benchmark suites—equivalently, attaining the same confidence interval width with only one-fifth of the query budget.
📝 Abstract
Exhaustively evaluating many large language models (LLMs) on a large suite of benchmarks is expensive. We cast benchmarking as finite-population inference and, under a fixed query budget, seek tight confidence intervals (CIs) for model accuracy with valid frequentist coverage. We propose Factorized Active Querying (FAQ), which (a) leverages historical information through a Bayesian factor model; (b) adaptively selects questions using a hybrid variance-reduction/active-learning sampling policy; and (c) maintains validity through Proactive Active Inference -- a finite-population extension of active inference (Zrnic&Cand\`es, 2024) that enables direct question selection while preserving coverage. With negligible overhead cost, FAQ delivers up to $5\times$ effective sample size gains over strong baselines on two benchmark suites, across varying historical-data missingness levels: this means that it matches the CI width of uniform sampling while using up to $5\times$ fewer queries. We release our source code and our curated datasets to support reproducible evaluation and future research.