YourBench: Easy Custom Evaluation Sets for Everyone

📅 2025-04-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address bottlenecks in LLM evaluation—including static benchmark saturation, data contamination, and high cost/low efficiency of human annotation—this paper proposes a lightweight, annotation-free dynamic benchmark generation framework. Methodologically, it integrates LLM-driven automated question-answer generation, citation-based provenance verification, posterior knowledge isolation, and Spearman rank-order consistency validation, coupled with a cross-scale (3B–671B) algorithm-human collaborative evaluation paradigm. Key contributions include: (1) releasing Tempora-0325, the first open-source temporal evaluation set comprising >7K documents published after March 2025; (2) open-sourcing YourBench, a benchmarking library; (3) publishing >150K high-quality QA pairs with full reasoning traces. Empirical validation on an MMLU subset incurs <$15 in cost while preserving original model rankings exactly (Spearman ρ = 1), significantly enhancing domain customization, trustworthiness, and timeliness in LLM assessment.

Technology Category

Application Category

📝 Abstract
Evaluating large language models (LLMs) effectively remains a critical bottleneck, as traditional static benchmarks suffer from saturation and contamination, while human evaluations are costly and slow. This hinders timely or domain-specific assessment, crucial for real-world applications. We introduce YourBench, a novel, open-source framework that addresses these limitations by enabling dynamic, automated generation of reliable, up-to-date, and domain-tailored benchmarks cheaply and without manual annotation, directly from user-provided documents. We demonstrate its efficacy by replicating 7 diverse MMLU subsets using minimal source text, achieving this for under 15 USD in total inference costs while perfectly preserving the relative model performance rankings (Spearman Rho = 1) observed on the original benchmark. To ensure that YourBench generates data grounded in provided input instead of relying on posterior parametric knowledge in models, we also introduce Tempora-0325, a novel dataset of over 7K diverse documents, published exclusively after March 2025. Our comprehensive analysis spans 26 SoTA models from 7 major families across varying scales (3-671B parameters) to validate the quality of generated evaluations through rigorous algorithmic checks (e.g., citation grounding) and human assessments. We release the YourBench library, the Tempora-0325 dataset, 150k+ question answer pairs based on Tempora and all evaluation and inference traces to facilitate reproducible research and empower the community to generate bespoke benchmarks on demand, fostering more relevant and trustworthy LLM evaluation.
Problem

Research questions and friction points this paper is trying to address.

Dynamic automated benchmark generation for LLM evaluation
Low-cost domain-specific assessment without manual annotation
Ensuring benchmark reliability via novel dataset and validation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic automated benchmark generation from documents
Low-cost domain-specific evaluation without manual annotation
Novel dataset ensures grounding in post-2025 knowledge
🔎 Similar Papers
No similar papers found.