Who Annotates in NLP? A Large-scale Assessment of Human Annotation Reporting between 2018 and 2025

📅 2026-06-01
📈 Citations: 0
Influential: 0
📄 PDF

career value

154K/year
🤖 AI Summary
This study addresses the widespread problem of incomplete reporting of annotation practices in natural language processing (NLP) research, which undermines reproducibility and quality assessment. Analyzing 1,603 papers from major NLP conferences between 2018 and 2025, the work introduces a unified taxonomy for annotation reporting that spans tasks, time, and domains, along with a minimal reporting standard. Leveraging a gold-standard dataset—Annotated-gold—curated through a combination of large language models and human adjudication, the authors construct Annotated-llm, achieving human-level inter-annotator agreement (Krippendorff’s α = 0.606) on structured information extraction. Despite gradual improvements in reporting over time, critical details—such as annotator training, linguistic competence, and compensation—remain frequently omitted. These findings advance the push toward more transparent and reliable annotation practices in NLP.
📝 Abstract
Human annotation is the empirical foundation of much NLP research, from dataset construction to model evaluation, but papers often leave unclear who produced the annotations and how the annotation process was controlled. We provide the first large-scale, task-level audit of human annotation reporting across major NLP venues, asking which annotation details are documented, which are missing, and how reporting varies across time, topic, venue, and intended use of human judgment. We introduce a unified taxonomy of annotation-reporting practices and validate an LLM-assisted extraction pipeline against Annotated-gold, a human-adjudicated gold standard of 41 papers and 72 annotation tasks, where the best model reaches human-comparable agreement with adjudicated labels, with Krippendorff's alpha of 0.606 versus 0.585 for human-human agreement. Using this pipeline, we construct Annotated-llm, a dataset covering ACL-venue papers from 2018-2025, with 2,667 extracted annotation tasks from 1,603 papers, and find that papers frequently report operational details such as recruitment strategies, annotator expertise, and annotation volume, but often omit details needed to assess annotation validity, including training, language proficiency, compensation, socio-demographics, adjudication, and agreement values, especially in model-evaluation studies. Our results show that annotation reporting in NLP has improved over time but remains uneven, and they establish a scalable framework and bare-minimum reporting recommendations for making human annotation more reliable, reproducible, and interpretable.
Problem

Research questions and friction points this paper is trying to address.

human annotation
annotation reporting
reproducibility
annotation validity
NLP datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

human annotation auditing
LLM-assisted extraction
annotation reporting taxonomy
reproducibility in NLP
Annotated-llm dataset
Maria Kunilovskaya
Maria Kunilovskaya
PhD, Postdoc at University of Saarland
machine learningcomputational linguisticsparallel corporatranslation quality estimationlearner translator corpora
Gagan Bhatia
Gagan Bhatia
University of Aberdeen
Natural Language ProcessingMachine LearningDeep LearningLLM AlignmentFinancial NLP
L
Lisa Sophie Albertelli
NLLG Lab University of Technology Nuremberg
Yanran Chen
Yanran Chen
PhD student, Technische Universität Nürnberg
NLP
C
Christian Greisinger
NLLG Lab University of Technology Nuremberg
L
Lotta Kiefer
NLLG Lab University of Technology Nuremberg
Christoph Leiter
Christoph Leiter
PhD Student, University of Mannheim
Evaluation metrics for natural language generation
S
Subhadeep Roy
NLLG Lab University of Technology Nuremberg
T
Tewodros Achamaleh
Interdisciplinary Transformation University, Austria
Muhammad Arslan Manzoor
Muhammad Arslan Manzoor
IT:U Austria
NLPLLMsMedia BiasSocial Graphs
S
Sebastian Pohl
Interdisciplinary Transformation University, Austria
Y
Yufang Hou
Interdisciplinary Transformation University, Austria
Steffen Eger
Steffen Eger
Full Professor, University of Technology Nuremberg (UTN)
Evaluation MetricsNatural Language ProcessingDeep LearningComputational Social Science