From Uncertain Judgments to Calibrated Rankings: Conformal Elo Estimation for LLM Evaluation

📅 2026-06-11

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the high cost of human annotation and the systematic biases inherent in using large language models (LLMs) as judges for LLM evaluation, which often distort model rankings. To overcome these limitations, the authors propose a low-cost, high-fidelity evaluation framework featuring a novel two-tier calibration mechanism: at the local level, it estimates pairwise comparison uncertainty based on discrepancies in judge scores; at the global level, it employs split conformal prediction to generate distribution-free error coverage intervals. The calibrated win probabilities are then integrated into a Bradley–Terry model to construct an Elo rating system with quantified uncertainty. Evaluated on 55 holdout models from LMArena, the method achieves an average absolute error of 17.9 against human ratings while providing theoretically guaranteed prediction intervals.

📝 Abstract

Evaluating new large language models typically requires costly human annotation campaigns at scale. LLM-as-a-judge offers a cheaper alternative, but judge scores carry systematic errors - such as position bias, self-preference, or intransitivity - that can strongly miscalibrate the resulting rankings. We quantify the resulting judge-human disagreement at two complementary levels. At the local level, we estimate per-battle uncertainty from the judge's own score differences by propagating calibrated win probabilities rather than hard labels into the Bradley-Terry procedure. This alone provides a drastic improvement to Elo estimation accuracy, bringing LLM-derived ratings within 17.9 Elo MAE of human-derived ones when averaged over 55 held-out models on LMArena. At the global level, we apply split conformal prediction to the residual gap between LLM-derived and human-derived Elo ratings across held-out models, producing prediction intervals with distribution-free marginal coverage guarantees that account for irreducible LLM-human disagreement. Together, these two layers yield a low-cost evaluation tool that provides developers with calibrated Elo estimates and honest uncertainty bounds, without access to large-scale human annotations.To facilitate reproducibility, we release our code at https://github.com/kargibora/SoftElo .

Problem

Research questions and friction points this paper is trying to address.

LLM evaluation

systematic bias

ranking calibration

Elo estimation

human disagreement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Conformal Prediction

Elo Estimation

LLM-as-a-Judge