An Empirical Study of LLM-as-a-Judge: How Design Choices Impact Evaluation Reliability

📅 2025-06-16

📈 Citations: 0

✨ Influential: 0

career value

152K/year

🤖 AI Summary

To address the reliability deficits of LLM-as-a-Judge automated evaluation, this paper systematically investigates, using the BIG-Bench and EvalBiasBench benchmarks, how evaluation criteria design, decoding strategies (greedy vs. sampling), and Chain-of-Thought (CoT) prompting affect human alignment and assessment stability. Our key contributions are threefold: first, we empirically establish that evaluation criterion quality is the dominant factor governing reliability; second, non-deterministic sampling significantly improves alignment with human preferences—yielding a +12.3% gain in Kendall τ—outperforming greedy decoding; third, under well-specified criteria, CoT provides negligible additional benefit, challenging its assumed universality and necessity. These findings provide critical empirical evidence and actionable design principles for developing trustworthy, reproducible LLM-based automatic evaluation frameworks.

Technology Category

Application Category

📝 Abstract

As large language models (LLMs) continue to advance, reliable evaluation methods are essential particularly for open-ended, instruction-following tasks. LLM-as-a-Judge enables automatic evaluation using LLMs as evaluators, but its reliability remains uncertain. In this work, we analyze key factors affecting its trustworthiness, focusing on alignment with human judgments and evaluation consistency. Using BIGGENBench and EvalBiasBench, we study the effects of evaluation design, decoding strategies, and Chain-of-Tought (CoT) reasoning in evaluation. Our results show that evaluation criteria are critical for reliability, non-deterministic sampling improves alignment with human preferences over deterministic evaluation, and CoT reasoning offers minimal gains when clear evaluation criteria are present.

Problem

Research questions and friction points this paper is trying to address.

Analyzing factors affecting LLM-as-a-Judge reliability

Studying alignment between LLM and human judgments

Investigating impact of evaluation design on consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluation criteria critical for reliability

Non-deterministic sampling improves human alignment

CoT reasoning minimal with clear criteria

🔎 Similar Papers

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks