🤖 AI Summary
Current large language models (LLMs) exhibit insufficient reliability as evaluators for long-text generation and lack a systematic benchmark tailored to complex document-level tasks. To address this gap, this work proposes LongJudgeBench—the first comprehensive evaluation benchmark specifically designed for assessing LLM-based judges on long-form outputs. It encompasses diverse real-world scenarios and multiple evaluation protocols, including scoring rubrics and reference texts. Through extensive multi-model comparative experiments, the study systematically evaluates the performance of existing LLM evaluators, revealing their marked instability across different contexts and demonstrating that current auxiliary information—such as scoring criteria or reference texts—offers limited improvement in reliability. This work establishes a critical benchmark and outlines key directions for advancing automatic evaluation of long-text generation.
📝 Abstract
As large language models (LLMs) are increasingly used for long-form generation, reliably evaluating long-form outputs has become a critical challenge. LLM-as-a-judge offers a scalable alternative to human evaluation, yet its reliability in long-form output evaluation remains underexamined: existing meta-evaluation benchmarks focus mainly on short-form outputs. Compared with short-form evaluation, long-form evaluation is not merely a matter of output length; it often requires judges to handle more complex document-level demands. In this work, we introduce LongJudgeBench, a comprehensive benchmark for evaluating LLM judges on long-form outputs across diverse real-world scenarios and judging protocols. We systematically evaluate a broad range of LLM judges, covering multiple base models and judging settings. Our results reveal a substantial reliability gap: current LLM judges remain unstable across scenarios, and rubrics or references are helpful but not always sufficient. We hope LongJudgeBench will support future research on more robust, context-aware, and human-aligned LLM-as-a-judge methods. Our code is available at https://anonymous.4open.science/r/LongJudgeBench-F782.