When Judgment Becomes Noise: How Design Failures in LLM Judge Benchmarks Silently Undermine Validity

📅 2025-09-24

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

LLM judge-based benchmarks suffer from ill-defined objectives and unverifiable structures, leading to high-confidence yet unreliable model rankings. Method: We propose a dual-diagnostic framework—“pattern adherence” and “psychometric validity”—to quantify judges’ deviations from scoring criteria and benchmark-internal irreducible uncertainty. Using internal consistency and discriminant validity metrics, we re-evaluate prominent benchmarks (e.g., Arena-Hard Auto) and analyze how ELO aggregation obscures judgment uncertainty. Contribution/Results: Empirical analysis reveals pervasive pattern inconsistency and factor collapse across mainstream LLM judges: unexplained variance exceeds 90%, and inter-dimension correlations reach ≥0.93—severely undermining benchmark validity. This work provides the first systematic evidence of structural failure in LLM judge-based evaluation and introduces a reproducible, psychometrically grounded validity diagnostic paradigm.

Technology Category

Application Category

📝 Abstract

LLM-judged benchmarks are increasingly used to evaluate complex model behaviors, yet their design introduces failure modes absent in conventional ground-truth based benchmarks. We argue that without tight objectives and verifiable constructions, benchmark rankings can produce high-confidence rankings that are in fact largely noise. We introduce two mechanisms to diagnose these issues. Schematic adherence quantifies how much of a judge's overall verdict is explained by the explicit evaluation schema, revealing unexplained variance when judges deviate from their own rubric. Psychometric validity aggregates internal consistency and discriminant validity signals to quantify irreducible uncertainty in any benchmarking run. Applying these tools to Arena-Hard Auto, we find severe schema incoherence and factor collapse across popular judges: for example, unexplained variance exceeding 90 percent for DeepSeek-R1-32B and factor correlations above 0.93 for most criteria. We also show that the ELO-style aggregation used by Arena-Hard Auto collapses and masks genuine ranking uncertainty. Our results highlight design failures that undermine validity and offer actionable principles for building better-scoped, reliability-aware LLM-judged benchmarks. We release our code at https://anonymous.4open.science/r/judgment-to-noise-947D/README.md

Problem

Research questions and friction points this paper is trying to address.

LLM-judged benchmarks introduce failure modes absent in ground-truth benchmarks

Benchmark rankings produce high-confidence results that are largely noise

Current designs undermine validity through schema incoherence and factor collapse

Innovation

Methods, ideas, or system contributions that make the work stand out.

Schematic adherence quantifies unexplained variance in judge verdicts

Psychometric validity measures internal consistency and discriminant validity

Tools reveal schema incoherence and factor collapse in benchmarks

🔎 Similar Papers

An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Model is not a General Substitute for GPT-4

2024-03-05Citations: 21

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks

2024-06-26arXiv.orgCitations: 69

Authors to Follow