When Judgment Becomes Noise: How Design Failures in LLM Judge Benchmarks Silently Undermine Validity

📅 2025-09-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
LLM judge-based benchmarks suffer from ill-defined objectives and unverifiable structures, leading to high-confidence yet unreliable model rankings. Method: We propose a dual-diagnostic framework—“pattern adherence” and “psychometric validity”—to quantify judges’ deviations from scoring criteria and benchmark-internal irreducible uncertainty. Using internal consistency and discriminant validity metrics, we re-evaluate prominent benchmarks (e.g., Arena-Hard Auto) and analyze how ELO aggregation obscures judgment uncertainty. Contribution/Results: Empirical analysis reveals pervasive pattern inconsistency and factor collapse across mainstream LLM judges: unexplained variance exceeds 90%, and inter-dimension correlations reach ≥0.93—severely undermining benchmark validity. This work provides the first systematic evidence of structural failure in LLM judge-based evaluation and introduces a reproducible, psychometrically grounded validity diagnostic paradigm.

Technology Category

Application Category

📝 Abstract
LLM-judged benchmarks are increasingly used to evaluate complex model behaviors, yet their design introduces failure modes absent in conventional ground-truth based benchmarks. We argue that without tight objectives and verifiable constructions, benchmark rankings can produce high-confidence rankings that are in fact largely noise. We introduce two mechanisms to diagnose these issues. Schematic adherence quantifies how much of a judge's overall verdict is explained by the explicit evaluation schema, revealing unexplained variance when judges deviate from their own rubric. Psychometric validity aggregates internal consistency and discriminant validity signals to quantify irreducible uncertainty in any benchmarking run. Applying these tools to Arena-Hard Auto, we find severe schema incoherence and factor collapse across popular judges: for example, unexplained variance exceeding 90 percent for DeepSeek-R1-32B and factor correlations above 0.93 for most criteria. We also show that the ELO-style aggregation used by Arena-Hard Auto collapses and masks genuine ranking uncertainty. Our results highlight design failures that undermine validity and offer actionable principles for building better-scoped, reliability-aware LLM-judged benchmarks. We release our code at https://anonymous.4open.science/r/judgment-to-noise-947D/README.md
Problem

Research questions and friction points this paper is trying to address.

LLM-judged benchmarks introduce failure modes absent in ground-truth benchmarks
Benchmark rankings produce high-confidence results that are largely noise
Current designs undermine validity through schema incoherence and factor collapse
Innovation

Methods, ideas, or system contributions that make the work stand out.

Schematic adherence quantifies unexplained variance in judge verdicts
Psychometric validity measures internal consistency and discriminant validity
Tools reveal schema incoherence and factor collapse in benchmarks