Contrastive Decoding Mitigates Score Range Bias in LLM-as-a-Judge

📅 2025-10-20

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Large language models (LLMs) employed as reference-free direct evaluators suffer from a pervasive “score-range bias”: their output scores exhibit strong dependence on the pre-specified numerical range, degrading alignment with human judgments; this bias demonstrates cross-model consistency within the same model family. This work is the first to systematically identify and characterize this phenomenon. We propose Contrastive Decoding—a lightweight, plug-and-play calibration strategy that mitigates implicit reliance on predefined score ranges by generating multi-scale scoring responses in parallel for the same input. Evaluated across multiple benchmark datasets, our method significantly improves stability across diverse score intervals, boosting average Spearman correlation with human ratings by 11.3% over baselines. Crucially, it requires no additional training, fine-tuning, or human annotation. This study establishes an interpretable, parameter-free, and deployment-ready paradigm for enhancing LLMs’ reliability as automatic evaluators.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are commonly used as evaluators in various applications, but the reliability of the outcomes remains a challenge. One such challenge is using LLMs-as-judges for direct assessment, i.e., assigning scores from a specified range without any references. We first show that this challenge stems from LLM judge outputs being associated with score range bias, i.e., LLM judge outputs are highly sensitive to pre-defined score ranges, preventing the search for optimal score ranges. We also show that similar biases exist among models from the same family. We then mitigate this bias through contrastive decoding, achieving up to 11.3% relative improvement on average in Spearman correlation with human judgments across different score ranges.

Problem

Research questions and friction points this paper is trying to address.

LLM judges exhibit bias from predefined score ranges

Score range sensitivity prevents optimal evaluation performance

Contrastive decoding reduces bias and improves human correlation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Contrastive decoding mitigates LLM score range bias

Method improves correlation with human judgments by 11.3%

Technique addresses bias in reference-free LLM evaluation

🔎 Similar Papers

No similar papers found.