Knockout LLM Assessment: Using Large Language Models for Evaluations through Iterative Pairwise Comparisons

๐Ÿ“… 2025-06-04
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing LLM evaluation methods rely predominantly on single-round pairwise comparisons or independent scoring, limiting their ability to establish globally consistent and robust rankings. Method: This paper introduces an iterative pairwise comparison framework inspired by sports knockout tournamentsโ€”the first such application in LLM evaluation. It employs multi-round dynamic adversarial comparisons to accumulate preference signals and calibrate judgments via feedback, while incorporating explicit consistency constraints to enhance ranking robustness. The approach integrates preference modeling, rank aggregation, and iterative optimization. Results: Evaluated on university exam grading and machine translation assessment tasks, the method increases the average Pearson correlation between LLM scores and expert judgments by 0.07, significantly improving alignment with human preferences. It establishes a novel paradigm for deploying LLMs as reliable, trustworthy evaluators.

Technology Category

Application Category

๐Ÿ“ Abstract
Large Language Models (LLMs) have shown to be effective evaluators across various domains such as machine translations or the scientific domain. Current LLM-as-a-Judge approaches rely mostly on individual assessments or a single round of pairwise assessments, preventing the judge LLM from developing a global ranking perspective. To address this, we present Knockout Assessment, an LLM-asa Judge method using a knockout tournament system with iterative pairwise comparisons. Experiments across three LLMs on two datasets show that knockout assessment improves scoring accuracy, increasing Pearson correlation with expert evaluations by 0.07 on average for university-level exam scoring and machine translation evaluations, aligning LLM assessments more closely with human scoring.
Problem

Research questions and friction points this paper is trying to address.

Improving LLM evaluation accuracy through iterative pairwise comparisons
Addressing limitations of single-round LLM-as-a-Judge assessments
Aligning LLM scoring more closely with human expert evaluations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses knockout tournament system for evaluations
Implements iterative pairwise comparisons method
Improves scoring accuracy with human alignment
๐Ÿ”Ž Similar Papers