Judging Against the Reference: Uncovering Knowledge-Driven Failures in LLM-Judges on QA Evaluation

📅 2026-01-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
When large language models (LLMs) serve as automated judges for question-answering evaluation, their scoring can be distorted by internal belief conflicts when the reference answer contradicts their pretrained knowledge. This work proposes a “reference-swapping” framework that constructs aligned pairs of candidate answers via entity substitution, enabling systematic investigation of LLM judges’ behavior under controlled reference–belief conflicts. Experiments demonstrate that the scoring reliability of mainstream LLM judges significantly degrades in such settings, and existing prompting strategies fail to adequately mitigate this issue. This study is the first to identify and quantify this failure mode, revealing a fundamental limitation in current automatic evaluation methodologies.

Technology Category

Application Category

📝 Abstract
While large language models (LLMs) are increasingly used as automatic judges for question answering (QA) and other reference-conditioned evaluation tasks, little is known about their ability to adhere to a provided reference. We identify a critical failure mode of such reference-based LLM QA evaluation: when the provided reference conflicts with the judge model's parametric knowledge, the resulting scores become unreliable, substantially degrading evaluation fidelity. To study this phenomenon systematically, we introduce a controlled swapped-reference QA framework that induces reference-belief conflicts. Specifically, we replace the reference answer with an incorrect entity and construct diverse pairings of original and swapped references with correspondingly aligned candidate answers. Surprisingly, grading reliability drops sharply under swapped references across a broad set of judge models. We empirically show that this vulnerability is driven by judges'over-reliance on parametric knowledge, leading judges to disregard the given reference under conflict. Finally, we find that this failure persists under common prompt-based mitigation strategies, highlighting a fundamental limitation of LLM-as-a-judge evaluation and motivating reference-based protocols that enforce stronger adherence to the provided reference.
Problem

Research questions and friction points this paper is trying to address.

LLM-as-a-judge
reference-based evaluation
knowledge conflict
QA evaluation
evaluation fidelity
Innovation

Methods, ideas, or system contributions that make the work stand out.

reference-based evaluation
LLM-as-a-judge
parametric knowledge conflict
swapped-reference framework
evaluation fidelity
🔎 Similar Papers
No similar papers found.