🤖 AI Summary
This work exposes a systematic limitation of large language models (LLMs) as unsupervised evaluators for dialogue response correctness: their judgment accuracy is strongly contingent on their own problem-solving capability, leading to a “consistency trap”—a sharp decline in discrimination performance when the LLM itself produces incorrect answers. To address this, the authors introduce BFF-Bench, a novel benchmark comprising 1,200 human-annotated dialogue evaluation instances, and conduct multi-model experiments (e.g., Qwen-2.5-7B, GPT-4o) with fine-grained consistency analysis. They propose, for the first time, calibrating LLM judges using high-quality human reference responses as anchors. Empirical results demonstrate that pairing a weaker LLM judge with high-quality human references achieves higher answer consistency—by up to 12.3%—than pairing a stronger LLM with low-quality synthetic references. This validates the core insight that reference quality outweighs model strength in LLM-based evaluation.
📝 Abstract
LLM-as-a-Judge is a framework that uses an LLM (large language model) to evaluate the quality of natural language text - typically text that is also generated by an LLM. This framework holds great promise due to its relative low-cost, ease of use, and strong correlations with human stylistic preferences. However, LLM Judges have been shown to exhibit biases that can distort their judgments. We evaluate how well LLM Judges can grade whether a given response to a conversational question is correct, an ability crucial to soundly estimating the overall response quality. To do so, we create and publicly release a human-annotated dataset with labels of correctness for 1,200 LLM responses. We source questions from a combination of existing datasets and a novel, challenging benchmark (BFF-Bench) created for this analysis. We demonstrate a strong connection between an LLM's ability to correctly answer a question and grade responses to that question. Although aggregate level statistics might imply a judge has high agreement with human annotators, it will struggle on the subset of questions it could not answer. To address this issue, we recommend a simple solution: provide the judge with a correct, human-written reference answer. We perform an in-depth analysis on how reference quality can affect the performance of an LLM Judge. We show that providing a weaker judge (e.g. Qwen 2.5 7B) with higher quality references reaches better agreement with human annotators than a stronger judge (e.g. GPT-4o) with synthetic references.