Beyond Consensus: Mitigating the Agreeableness Bias in LLM Judge Evaluations

📅 2025-10-13

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

LLM-as-a-judge exhibits significant positive bias: high true positive rate (96%) but extremely low true negative rate (<25%), compounded by class imbalance, leading to systematically inflated evaluation scores. To address this, we propose a dual-track calibration framework: (1) a “minority veto” mechanism that assigns higher discriminative weight to invalid outputs in ensemble voting; and (2) a regression-based bias modeling module that explicitly estimates and corrects the LLM’s inherent agreement bias using minimal human annotations. Evaluated on 366 high-school-level Python code feedback tasks, our method reduces the maximum absolute error to 1.2%, doubling the performance of the best ensemble baseline. This work achieves, for the first time, interpretable, quantifiable, and calibratable mitigation of LLM judgment bias—enabling reliable, bias-aware automated assessment.

Technology Category

Application Category

📝 Abstract

New Large Language Models (LLMs) become available every few weeks, and modern application developers confronted with the unenviable task of having to decide if they should switch to a new model. While human evaluation remains the gold standard, it is costly and unscalable. The state-of-the-art approach is to use LLMs as evaluators ( LLM-as-a-judge), but this suffers from a critical flaw: LLMs exhibit a strong positive bias. We provide empirical evidence showing that while LLMs can identify valid outputs with high accuracy (i.e., True Positive Rate 96%), they are remarkably poor at identifying invalid ones (i.e., True Negative Rate <25%). This systematic bias, coupled with class imbalance, often leads to inflated reliability scores. While ensemble-based methods like majority voting can help, we show that they are not good enough. We introduce an optimal minority-veto strategy that is resilient to missing data and mitigates this bias to a large extent. For scenarios requiring even higher precision, we propose a novel regression-based framework that directly models the validator bias using a small set of human-annotated ground truth data. On a challenging code feedback task over 366 high-school Python programs, our regression approach reduces the maximum absolute error to just 1.2%, achieving a 2x improvement over the best-performing ensemble of 14 state-of-the-art LLMs.

Problem

Research questions and friction points this paper is trying to address.

Mitigating positive bias in LLM judge evaluations

Addressing poor true negative rate in LLM validators

Correcting inflated reliability scores from systematic bias

Innovation

Methods, ideas, or system contributions that make the work stand out.

Minority-veto strategy mitigates LLM positive bias

Regression framework models validator bias with human data

Reduces error by 2x in code feedback evaluation

🔎 Similar Papers

Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks

2024-06-12Citations: 0

Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges

2024-06-18arXiv.orgCitations: 25

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks

2024-06-26arXiv.orgCitations: 69

Authors to Follow