Benchmarking LLMs' Judgments with No Gold Standard

📅 2024-11-11

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Evaluating LLM-generated judgments for subjective tasks—e.g., academic peer review—remains challenging due to the absence of gold-standard references. Method: This paper introduces GEM, an unsupervised generative evaluation metric grounded in mutual information estimation, and constructs GRE-bench—a contamination-resistant, dynamically updated benchmark. Contribution/Results: GEM is the first metric enabling robust, gold-standard-free evaluation of subjective judgment generation. We propose a manipulation-resistant dynamic benchmark design and an LLM-as-judge adversarial analysis framework. Experiments show GEM achieves correlation with human annotations comparable to GPT-4o Examiner, significantly outperforming existing baselines; exhibits superior robustness against strategic manipulations (e.g., paraphrasing, redundancy); and enables systematic assessment of peer-review capabilities across mainstream LLMs on the ICLR 2023 dataset.

Technology Category

Application Category

📝 Abstract

We introduce the GEM (Generative Estimator for Mutual Information), an evaluation metric for assessing language generation by Large Language Models (LLMs), particularly in generating informative judgments, without the need for a gold standard reference. GEM broadens the scenarios where we can benchmark LLM generation performance-from traditional ones, like machine translation and summarization, where gold standard references are readily available, to subjective tasks without clear gold standards, such as academic peer review. GEM uses a generative model to estimate mutual information between candidate and reference responses, without requiring the reference to be a gold standard. In experiments on a human-annotated dataset, GEM demonstrates competitive correlations with human scores compared to the state-of-the-art GPT-4o Examiner, and outperforms all other baselines. Additionally, GEM is more robust against strategic manipulations, such as rephrasing or elongation, which can artificially inflate scores under a GPT-4o Examiner. We also present GRE-bench (Generating Review Evaluation Benchmark) which evaluates LLMs based on how well they can generate high-quality peer reviews for academic research papers. Because GRE-bench is based upon GEM, it inherits its robustness properties. Additionally, GRE-bench circumvents data contamination problems (or data leakage) by using the continuous influx of new open-access research papers and peer reviews each year. We show GRE-bench results of various popular LLMs on their peer review capabilities using the ICLR2023 dataset.

Problem

Research questions and friction points this paper is trying to address.

Assessing LLM generation without gold standard references

Evaluating subjective tasks like academic peer review

Robust metric against strategic manipulations in scoring

Innovation

Methods, ideas, or system contributions that make the work stand out.

GEM estimates mutual information without gold standards

GEM outperforms GPT-4o in robustness and correlation

GRE-bench avoids data contamination with new papers

🔎 Similar Papers

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks