From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge

📅 2024-11-25

🏛️ arXiv.org

📈 Citations: 4

✨ Influential: 0

career value

146K/year

🤖 AI Summary

Existing LLM-as-a-judge evaluation paradigms lack a unified definition, systematic taxonomy, and dedicated benchmark, hindering fine-grained, reproducible assessment of judgment capabilities. Method: We formally define the LLM-as-a-judge paradigm and propose the first three-dimensional taxonomy—*what to judge* (task types), *how to judge* (methodological mechanisms), and *where to judge* (deployment scenarios). We further introduce the first open-source benchmark explicitly designed for evaluating judgment capability, covering 30+ diverse tasks and 15+ mainstream LLMs, and integrating prompt engineering, multi-granularity scoring modeling, and task-adaptive evaluation protocols. Contribution/Results: This work establishes discriminative AI evaluation as an independent research direction. Our benchmark, taxonomy, and evaluation framework have been widely adopted and extended by the research community, enabling standardized, transparent, and scalable assessment of LLM-based judges.

Technology Category

Application Category

📝 Abstract

Assessment and evaluation have long been critical challenges in artificial intelligence (AI) and natural language processing (NLP). However, traditional methods, whether matching-based or embedding-based, often fall short of judging subtle attributes and delivering satisfactory results. Recent advancements in Large Language Models (LLMs) inspire the"LLM-as-a-judge"paradigm, where LLMs are leveraged to perform scoring, ranking, or selection across various tasks and applications. This paper provides a comprehensive survey of LLM-based judgment and assessment, offering an in-depth overview to advance this emerging field. We begin by giving detailed definitions from both input and output perspectives. Then we introduce a comprehensive taxonomy to explore LLM-as-a-judge from three dimensions: what to judge, how to judge and where to judge. Finally, we compile benchmarks for evaluating LLM-as-a-judge and highlight key challenges and promising directions, aiming to provide valuable insights and inspire future research in this promising research area. Paper list and more resources about LLM-as-a-judge can be found at url{https://github.com/llm-as-a-judge/Awesome-LLM-as-a-judge} and url{https://llm-as-a-judge.github.io}.

Problem

Research questions and friction points this paper is trying to address.

Artificial Intelligence

Natural Language Processing

Large Language Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-as-a-judge

comprehensive approach

benchmark evaluation

🔎 Similar Papers

Leveraging Large Language Models for Relevance Judgments in Legal Case Retrieval