Auditing LLM Benchmarks with Item Response Theory

📅 2026-05-28
📈 Citations: 0
Influential: 0
📄 PDF

career value

162K/year
🤖 AI Summary
This study addresses pervasive issues in current large language model (LLM) benchmarks, including label errors, mechanical annotation biases, and ambiguous samples, which are silently propagated in downstream tasks. For the first time, item response theory (IRT) is introduced to audit LLM benchmarks. Leveraging responses from 114 models across seven multiple-choice and preference-based benchmarks, the authors develop metrics to identify samples with high probabilities of mislabeling. By integrating multi-model response analysis with supervised classification comparisons, the method successfully pinpoints the top 200 suspected mislabeled instances per benchmark at 95% precision. The findings reveal that state-of-the-art reward models agree with mislabeled samples 78% of the time—significantly higher than the 38% agreement rate of comparable models—suggesting a stronger bias toward stylistic preferences over factual accuracy and underscoring the critical impact of benchmark quality on evaluation reliability.
📝 Abstract
LLM benchmark labels are frozen at release and silently propagated into downstream benchmarks, errors and all. We introduce an Item Response Theory-based indicator that surfaces likely mislabels at 95% precision in the top 200 examples across seven preference and multiple-choice benchmarks using responses from 114 models, outperforming a supervised classifier. We trace these errors to mechanical labeling heuristics, upstream annotation mistakes inherited unchanged from source datasets, and fundamentally ambiguous items without a defensible single label. The same model fit reveals that reward models specialize in stylistic preference rather than factual knowledge, and identifies one frontier reward model that agrees with detected mislabels at 78% accuracy versus 38% for its peers, consistent with benchmark contamination or benchmark-specific over-optimization.
Problem

Research questions and friction points this paper is trying to address.

LLM benchmarks
label errors
Item Response Theory
benchmark contamination
ambiguous items
Innovation

Methods, ideas, or system contributions that make the work stand out.

Item Response Theory
LLM benchmark auditing
mislabel detection
reward model analysis
benchmark contamination
🔎 Similar Papers
No similar papers found.