🤖 AI Summary
LLM-as-a-judge exhibits significant positive bias: high true positive rate (96%) but extremely low true negative rate (<25%), compounded by class imbalance, leading to systematically inflated evaluation scores. To address this, we propose a dual-track calibration framework: (1) a “minority veto” mechanism that assigns higher discriminative weight to invalid outputs in ensemble voting; and (2) a regression-based bias modeling module that explicitly estimates and corrects the LLM’s inherent agreement bias using minimal human annotations. Evaluated on 366 high-school-level Python code feedback tasks, our method reduces the maximum absolute error to 1.2%, doubling the performance of the best ensemble baseline. This work achieves, for the first time, interpretable, quantifiable, and calibratable mitigation of LLM judgment bias—enabling reliable, bias-aware automated assessment.
📝 Abstract
New Large Language Models (LLMs) become available every few weeks, and modern application developers confronted with the unenviable task of having to decide if they should switch to a new model. While human evaluation remains the gold standard, it is costly and unscalable. The state-of-the-art approach is to use LLMs as evaluators ( LLM-as-a-judge), but this suffers from a critical flaw: LLMs exhibit a strong positive bias. We provide empirical evidence showing that while LLMs can identify valid outputs with high accuracy (i.e., True Positive Rate 96%), they are remarkably poor at identifying invalid ones (i.e., True Negative Rate <25%). This systematic bias, coupled with class imbalance, often leads to inflated reliability scores.
While ensemble-based methods like majority voting can help, we show that they are not good enough. We introduce an optimal minority-veto strategy that is resilient to missing data and mitigates this bias to a large extent. For scenarios requiring even higher precision, we propose a novel regression-based framework that directly models the validator bias using a small set of human-annotated ground truth data. On a challenging code feedback task over 366 high-school Python programs, our regression approach reduces the maximum absolute error to just 1.2%, achieving a 2x improvement over the best-performing ensemble of 14 state-of-the-art LLMs.