Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation

📅 2026-05-31

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Current large language models (LLMs) exhibit insufficient reliability as evaluators for long-text generation and lack a systematic benchmark tailored to complex document-level tasks. To address this gap, this work proposes LongJudgeBench—the first comprehensive evaluation benchmark specifically designed for assessing LLM-based judges on long-form outputs. It encompasses diverse real-world scenarios and multiple evaluation protocols, including scoring rubrics and reference texts. Through extensive multi-model comparative experiments, the study systematically evaluates the performance of existing LLM evaluators, revealing their marked instability across different contexts and demonstrating that current auxiliary information—such as scoring criteria or reference texts—offers limited improvement in reliability. This work establishes a critical benchmark and outlines key directions for advancing automatic evaluation of long-text generation.

📝 Abstract

As large language models (LLMs) are increasingly used for long-form generation, reliably evaluating long-form outputs has become a critical challenge. LLM-as-a-judge offers a scalable alternative to human evaluation, yet its reliability in long-form output evaluation remains underexamined: existing meta-evaluation benchmarks focus mainly on short-form outputs. Compared with short-form evaluation, long-form evaluation is not merely a matter of output length; it often requires judges to handle more complex document-level demands. In this work, we introduce LongJudgeBench, a comprehensive benchmark for evaluating LLM judges on long-form outputs across diverse real-world scenarios and judging protocols. We systematically evaluate a broad range of LLM judges, covering multiple base models and judging settings. Our results reveal a substantial reliability gap: current LLM judges remain unstable across scenarios, and rubrics or references are helpful but not always sufficient. We hope LongJudgeBench will support future research on more robust, context-aware, and human-aligned LLM-as-a-judge methods. Our code is available at https://anonymous.4open.science/r/LongJudgeBench-F782.

Problem

Research questions and friction points this paper is trying to address.

LLM-as-a-Judge

long-form evaluation

benchmark

reliability

text generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-as-a-Judge

long-form evaluation

benchmark

reliability

meta-evaluation

🔎 Similar Papers

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks

2024-06-26arXiv.orgCitations: 69

Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates

2024-08-23arXiv.orgCitations: 15

Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form QA

2024-08-17Proceedings of the 9th Widening NLP WorkshopCitations: 1