🤖 AI Summary
Large language models (LLMs) employed as reference-free direct evaluators suffer from a pervasive “score-range bias”: their output scores exhibit strong dependence on the pre-specified numerical range, degrading alignment with human judgments; this bias demonstrates cross-model consistency within the same model family. This work is the first to systematically identify and characterize this phenomenon. We propose Contrastive Decoding—a lightweight, plug-and-play calibration strategy that mitigates implicit reliance on predefined score ranges by generating multi-scale scoring responses in parallel for the same input. Evaluated across multiple benchmark datasets, our method significantly improves stability across diverse score intervals, boosting average Spearman correlation with human ratings by 11.3% over baselines. Crucially, it requires no additional training, fine-tuning, or human annotation. This study establishes an interpretable, parameter-free, and deployment-ready paradigm for enhancing LLMs’ reliability as automatic evaluators.
📝 Abstract
Large Language Models (LLMs) are commonly used as evaluators in various applications, but the reliability of the outcomes remains a challenge. One such challenge is using LLMs-as-judges for direct assessment, i.e., assigning scores from a specified range without any references. We first show that this challenge stems from LLM judge outputs being associated with score range bias, i.e., LLM judge outputs are highly sensitive to pre-defined score ranges, preventing the search for optimal score ranges. We also show that similar biases exist among models from the same family. We then mitigate this bias through contrastive decoding, achieving up to 11.3% relative improvement on average in Spearman correlation with human judgments across different score ranges.