Auditing LLM Benchmarks with Item Response Theory

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

This study addresses pervasive issues in current large language model (LLM) benchmarks, including label errors, mechanical annotation biases, and ambiguous samples, which are silently propagated in downstream tasks. For the first time, item response theory (IRT) is introduced to audit LLM benchmarks. Leveraging responses from 114 models across seven multiple-choice and preference-based benchmarks, the authors develop metrics to identify samples with high probabilities of mislabeling. By integrating multi-model response analysis with supervised classification comparisons, the method successfully pinpoints the top 200 suspected mislabeled instances per benchmark at 95% precision. The findings reveal that state-of-the-art reward models agree with mislabeled samples 78% of the time—significantly higher than the 38% agreement rate of comparable models—suggesting a stronger bias toward stylistic preferences over factual accuracy and underscoring the critical impact of benchmark quality on evaluation reliability.

📝 Abstract

LLM benchmark labels are frozen at release and silently propagated into downstream benchmarks, errors and all. We introduce an Item Response Theory-based indicator that surfaces likely mislabels at 95% precision in the top 200 examples across seven preference and multiple-choice benchmarks using responses from 114 models, outperforming a supervised classifier. We trace these errors to mechanical labeling heuristics, upstream annotation mistakes inherited unchanged from source datasets, and fundamentally ambiguous items without a defensible single label. The same model fit reveals that reward models specialize in stylistic preference rather than factual knowledge, and identifies one frontier reward model that agrees with detected mislabels at 78% accuracy versus 38% for its peers, consistent with benchmark contamination or benchmark-specific over-optimization.

Problem

Research questions and friction points this paper is trying to address.

LLM benchmarks

label errors

Item Response Theory

benchmark contamination

ambiguous items

Innovation

Methods, ideas, or system contributions that make the work stand out.

Item Response Theory

LLM benchmark auditing

mislabel detection

reward model analysis

benchmark contamination

🔎 Similar Papers

No similar papers found.