🤖 AI Summary
LLM benchmarking is frequently compromised by pretraining data leakage, leading to inflated scores, erroneous generalization assessments, and diminished reliability in cross-model comparisons. To address this, we propose ArenaBencher—a model-agnostic framework for automatic benchmark evolution. It iteratively generates more diagnostic, diverse, and challenging test cases by analyzing competitive performance across models on existing benchmarks, inferring capability profiles, identifying shared weaknesses, validating candidates via LLM-based adjudication, and aggregating feedback. Evaluated on mathematical reasoning, commonsense reasoning, and safety evaluation tasks, ArenaBencher significantly enhances benchmark fairness, discriminative power, and capacity to expose failure modes. Unlike static benchmarks, it establishes a sustainable, dynamic paradigm for LLM evaluation—where benchmarks evolve alongside model capabilities, ensuring long-term validity and robustness in comparative assessment.
📝 Abstract
Benchmarks are central to measuring the capabilities of large language models and guiding model development, yet widespread data leakage from pretraining corpora undermines their validity. Models can match memorized content rather than demonstrate true generalization, which inflates scores, distorts cross-model comparisons, and misrepresents progress. We introduce ArenaBencher, a model-agnostic framework for automatic benchmark evolution that updates test cases while preserving comparability. Given an existing benchmark and a diverse pool of models to be evaluated, ArenaBencher infers the core ability of each test case, generates candidate question-answer pairs that preserve the original objective, verifies correctness and intent with an LLM as a judge, and aggregates feedback from multiple models to select candidates that expose shared weaknesses. The process runs iteratively with in-context demonstrations that steer generation toward more challenging and diagnostic cases. We apply ArenaBencher to math problem solving, commonsense reasoning, and safety domains and show that it produces verified, diverse, and fair updates that uncover new failure modes, increase difficulty while preserving test objective alignment, and improve model separability. The framework provides a scalable path to continuously evolve benchmarks in step with the rapid progress of foundation models.