🤖 AI Summary
To address the reliability deficits of LLM-as-a-Judge automated evaluation, this paper systematically investigates, using the BIG-Bench and EvalBiasBench benchmarks, how evaluation criteria design, decoding strategies (greedy vs. sampling), and Chain-of-Thought (CoT) prompting affect human alignment and assessment stability. Our key contributions are threefold: first, we empirically establish that evaluation criterion quality is the dominant factor governing reliability; second, non-deterministic sampling significantly improves alignment with human preferences—yielding a +12.3% gain in Kendall τ—outperforming greedy decoding; third, under well-specified criteria, CoT provides negligible additional benefit, challenging its assumed universality and necessity. These findings provide critical empirical evidence and actionable design principles for developing trustworthy, reproducible LLM-based automatic evaluation frameworks.
📝 Abstract
As large language models (LLMs) continue to advance, reliable evaluation methods are essential particularly for open-ended, instruction-following tasks. LLM-as-a-Judge enables automatic evaluation using LLMs as evaluators, but its reliability remains uncertain. In this work, we analyze key factors affecting its trustworthiness, focusing on alignment with human judgments and evaluation consistency. Using BIGGENBench and EvalBiasBench, we study the effects of evaluation design, decoding strategies, and Chain-of-Tought (CoT) reasoning in evaluation. Our results show that evaluation criteria are critical for reliability, non-deterministic sampling improves alignment with human preferences over deterministic evaluation, and CoT reasoning offers minimal gains when clear evaluation criteria are present.