WikiNER-fr-gold: A Gold-Standard NER Corpus

📅 2024-10-29
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The French portion of the WikiNER corpus—a semi-supervised, unverified silver-standard resource—suffers from low annotation quality and poor consistency. Method: We construct WikiNER-fr-gold, the first fully human-verified gold-standard corpus for French named entity recognition (NER), covering 26,818 sentences and 700,000 tokens. It is accompanied by a structured annotation guideline and a systematic error analysis framework. Our novel “silver-to-gold” upgrading paradigm employs iterative human verification, reconstruction of the entity type taxonomy, inter-annotator agreement assessment, and root-cause error attribution. Contribution/Results: For the first time, we quantitatively identify critical biases in the original corpus—including nested entity omissions and type confusions. The resulting resource provides both a reproducible methodology for building multilingual NER gold standards and a high-quality benchmark dataset.

Technology Category

Application Category

📝 Abstract
We address in this article the the quality of the WikiNER corpus, a multilingual Named Entity Recognition corpus, and provide a consolidated version of it. The annotation of WikiNER was produced in a semi-supervised manner i.e. no manual verification has been carried out a posteriori. Such corpus is called silver-standard. In this paper we propose WikiNER-fr-gold which is a revised version of the French proportion of WikiNER. Our corpus consists of randomly sampled 20% of the original French sub-corpus (26,818 sentences with 700k tokens). We start by summarizing the entity types included in each category in order to define an annotation guideline, and then we proceed to revise the corpus. Finally we present an analysis of errors and inconsistency observed in the WikiNER-fr corpus, and we discuss potential future work directions.
Problem

Research questions and friction points this paper is trying to address.

Improving quality of WikiNER French corpus
Creating gold-standard NER annotations manually
Analyzing errors in original silver-standard corpus
Innovation

Methods, ideas, or system contributions that make the work stand out.

Revised French WikiNER corpus manually
Defined annotation guidelines for consistency
Analyzed errors in original silver-standard corpus
🔎 Similar Papers
No similar papers found.
D
Danrun Cao
Univ. Bretagne Sud, CNRS, IRISA
Nicolas Béchet
Nicolas Béchet
Docteur en Informatique, Université de Bretagne Sud, IRISA
Data MiningNLPText MiningInformation Extraction
P
P. Marteau
Univ. Bretagne Sud, CNRS, IRISA