No Free Labels: Limitations of LLM-as-a-Judge Without Human Grounding

📅 2025-03-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work exposes a systematic limitation of large language models (LLMs) as unsupervised evaluators for dialogue response correctness: their judgment accuracy is strongly contingent on their own problem-solving capability, leading to a “consistency trap”—a sharp decline in discrimination performance when the LLM itself produces incorrect answers. To address this, the authors introduce BFF-Bench, a novel benchmark comprising 1,200 human-annotated dialogue evaluation instances, and conduct multi-model experiments (e.g., Qwen-2.5-7B, GPT-4o) with fine-grained consistency analysis. They propose, for the first time, calibrating LLM judges using high-quality human reference responses as anchors. Empirical results demonstrate that pairing a weaker LLM judge with high-quality human references achieves higher answer consistency—by up to 12.3%—than pairing a stronger LLM with low-quality synthetic references. This validates the core insight that reference quality outweighs model strength in LLM-based evaluation.

Technology Category

Application Category

📝 Abstract
LLM-as-a-Judge is a framework that uses an LLM (large language model) to evaluate the quality of natural language text - typically text that is also generated by an LLM. This framework holds great promise due to its relative low-cost, ease of use, and strong correlations with human stylistic preferences. However, LLM Judges have been shown to exhibit biases that can distort their judgments. We evaluate how well LLM Judges can grade whether a given response to a conversational question is correct, an ability crucial to soundly estimating the overall response quality. To do so, we create and publicly release a human-annotated dataset with labels of correctness for 1,200 LLM responses. We source questions from a combination of existing datasets and a novel, challenging benchmark (BFF-Bench) created for this analysis. We demonstrate a strong connection between an LLM's ability to correctly answer a question and grade responses to that question. Although aggregate level statistics might imply a judge has high agreement with human annotators, it will struggle on the subset of questions it could not answer. To address this issue, we recommend a simple solution: provide the judge with a correct, human-written reference answer. We perform an in-depth analysis on how reference quality can affect the performance of an LLM Judge. We show that providing a weaker judge (e.g. Qwen 2.5 7B) with higher quality references reaches better agreement with human annotators than a stronger judge (e.g. GPT-4o) with synthetic references.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM Judges' ability to grade conversational response correctness.
Identifying biases in LLM Judges without human-annotated references.
Improving LLM Judge performance using high-quality human reference answers.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses human-annotated dataset for LLM evaluation
Introduces BFF-Bench for challenging LLM assessment
Recommends human-written reference answers for accuracy
🔎 Similar Papers
No similar papers found.