PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations

📅 2023-07-06
🏛️ Trans. Mach. Learn. Res.
📈 Citations: 78
Influential: 6
📄 PDF
🤖 AI Summary
To address self-enhancement and positional bias in large language model (LLM) self-evaluation, this paper proposes a reference-free, multi-model collaborative evaluation paradigm. Methodologically, it introduces (1) Peer Rank (PR), which performs anonymous pairwise preference comparisons among multiple LLMs followed by rank aggregation—enabling the first reliable, anonymity-preserving self-ranking of LLMs; and (2) Peer Discussion (PD), a structured prompting framework that facilitates bilateral negotiation between two models to reach consensus on evaluation judgments. The approach integrates prompt engineering, pairwise comparison modeling, and preference aggregation algorithms. Evaluated on two benchmark datasets, PR achieves significantly higher assessment accuracy than prior methods and demonstrates superior agreement with human judgments, validating both the effectiveness and generalizability of the proposed paradigm.
📝 Abstract
Nowadays, the quality of responses generated by different modern large language models (LLMs) is hard to evaluate and compare automatically. Recent studies suggest and predominantly use LLMs for reference-free evaluation of open-ended question answering. More specifically, they use the recognized"strongest"LLM as the evaluator, which conducts pairwise comparisons of candidate models' answers and provides a ranking score. However, this intuitive method has multiple problems, such as bringing in self-enhancement (favoring its own answers) and positional bias. We draw insights and lessons from the educational domain (Cho&MacArthur, 2011; Walsh, 2014) to improve LLM-based evaluations. Specifically, we propose (1) the peer rank (PR) algorithm that takes into account each peer LLM's pairwise preferences of all answer pairs, and outputs a final ranking of models; and (2) peer discussion (PD), where we prompt two LLMs to discuss and try to reach a mutual agreement on the preferences of two answers. We conduct experiments on two benchmark datasets. We find that our approaches achieve higher accuracy and align better with human judgments. Interestingly, PR can induce a relatively accurate self-ranking of models under the anonymous setting, where each model's name is unrevealed. Our work provides space to explore evaluating models that are hard to compare for humans.
Problem

Research questions and friction points this paper is trying to address.

Automatic Evaluation
Large Language Models
Response Quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Peer Rank
Peer Discussion
Large Language Models Evaluation