Quantitative LLM Judges

📅 2025-06-03

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

To address the misalignment between LLM-as-judge scores and human annotations, this paper proposes a regression-based post-hoc calibration framework that maps raw LLM-generated scores—derived from textual evaluations—to the human rating scale, using only the LLM’s original textual feedback and its associated numerical output. Introducing the novel paradigm of “regression-based quantized calibration,” the method uniformly supports both absolute and relative feedback types while ensuring computational efficiency and robustness under low-data regimes. Evaluated on four benchmark datasets, the calibration improves Kendall’s τ correlation between two foundational LLM judges and human ratings by an average of 18.7%. Crucially, it incurs significantly lower training cost than supervised fine-tuning. The core contribution is an efficient, general-purpose, low-overhead alignment technique that requires no additional human annotations or architectural modifications to the judge model.

Technology Category

Application Category

📝 Abstract

LLM-as-a-judge is a framework in which a large language model (LLM) automatically evaluates the output of another LLM. We propose quantitative LLM judges, which align evaluation scores of existing LLM judges to human scores in a given domain using regression models. The models are trained to improve the score of the original judge by using the judge's textual evaluation and score. We present four quantitative judges for different types of absolute and relative feedback, which showcases the generality and versatility of our framework. Our framework is more computationally efficient than supervised fine-tuning and can be more statistically efficient when human feedback is limited, which is expected in most applications of our work. We validate these claims empirically on four datasets using two base judges. Our experiments show that quantitative judges can effectively improve the predictive power of existing judges through post-hoc modeling.

Problem

Research questions and friction points this paper is trying to address.

Align LLM judge scores to human evaluation standards

Improve evaluation accuracy using regression models

Enhance computational efficiency with limited human feedback

Innovation

Methods, ideas, or system contributions that make the work stand out.

Aligns LLM judge scores to human scores

Uses regression models for score improvement

Computationally efficient than fine-tuning

🔎 Similar Papers

Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks