Illusions of the Gold Standard: A Large-scale Analysis of Human Evaluation Protocols for Long-form Text Generation

📅 2026-06-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the lack of transparent and reproducible protocols in human evaluation for long-form text generation, which hinders interpretability and cross-study comparison. To remedy this gap, the work proposes the first set of 20 reportable standards tailored to this task and conducts a large-scale systematic review of human evaluation practices. The analysis integrates manual annotation of 284 papers from *CL conferences (2023–2025) with large language model–assisted examination of over 1,800 additional publications, yielding a structured framework for assessment. The findings reveal that most studies omit critical methodological details, prompting the authors to formulate actionable recommendations to enhance transparency and reproducibility. Accompanying this contribution, the code and annotated dataset are publicly released.

📝 Abstract

Human evaluation plays a critical role in assessing the quality of generated text. However, the reliability and reproducibility of these evaluations depend on transparent and well-documented protocols -- details that are frequently missing in current practice. In this work, we conduct a large-scale analysis of human evaluation protocols for evaluating long-form generation tasks in *CL conference publications from 2023--2025, including a full manual review of 284 papers and LLM-assisted analysis for another 1.8k+ papers. We define a set of 20 reportable criteria related to reproducibility of human evaluation studies, and apply these criteria to systematically examine reporting norms and practices within the community. We find widespread under-reporting of important aspects of human evaluation study design, leading to ambiguity about what was measured and how, who contributed judgments, and how judgments should be interpreted. Based on these findings, we outline actionable recommendations to support more transparent and reproducible reporting in future research. Our analysis code and annotated dataset can be found at: https://github.com/larchlab/Illusions-of-the-Gold-Standard

Problem

Research questions and friction points this paper is trying to address.

human evaluation

reproducibility

long-form text generation

evaluation protocols

reporting standards

Innovation

Methods, ideas, or system contributions that make the work stand out.

human evaluation

reproducibility

long-form text generation

evaluation protocols

transparent reporting

🔎 Similar Papers

LLM-based NLG Evaluation: Current Status and Challenges

2024-02-02Computational LinguisticsCitations: 29