๐ค AI Summary
Existing LLM evaluation methods rely predominantly on single-round pairwise comparisons or independent scoring, limiting their ability to establish globally consistent and robust rankings.
Method: This paper introduces an iterative pairwise comparison framework inspired by sports knockout tournamentsโthe first such application in LLM evaluation. It employs multi-round dynamic adversarial comparisons to accumulate preference signals and calibrate judgments via feedback, while incorporating explicit consistency constraints to enhance ranking robustness. The approach integrates preference modeling, rank aggregation, and iterative optimization.
Results: Evaluated on university exam grading and machine translation assessment tasks, the method increases the average Pearson correlation between LLM scores and expert judgments by 0.07, significantly improving alignment with human preferences. It establishes a novel paradigm for deploying LLMs as reliable, trustworthy evaluators.
๐ Abstract
Large Language Models (LLMs) have shown to be effective evaluators across various domains such as machine translations or the scientific domain. Current LLM-as-a-Judge approaches rely mostly on individual assessments or a single round of pairwise assessments, preventing the judge LLM from developing a global ranking perspective. To address this, we present Knockout Assessment, an LLM-asa Judge method using a knockout tournament system with iterative pairwise comparisons. Experiments across three LLMs on two datasets show that knockout assessment improves scoring accuracy, increasing Pearson correlation with expert evaluations by 0.07 on average for university-level exam scoring and machine translation evaluations, aligning LLM assessments more closely with human scoring.