OpenNER 1.0: Standardized Open-Access Named Entity Recognition Datasets in 50+ Languages

📅 2024-12-12
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing NER research is hindered by the lack of standardized, openly accessible, and structurally homogeneous multilingual and multi-ontology datasets. To address this, we introduce the first open-source, standardized NER resource repository covering 51 languages and 34 datasets. Our resource unifies annotation formats (CoNLL), entity type naming conventions, and data schemas, and establishes a cross-ontology entity type mapping specification. This work represents the first systematic integration and standardization of multilingual and multi-ontology NER data, resolving longstanding bottlenecks—including format heterogeneity, non-comparable entity taxonomies, and restricted data access. We build a unified benchmark using mBERT, XLM-R, and InfoXLM, providing ready-to-use data splits and comprehensive baseline performance reports. The resource substantially lowers the barrier to multilingual NER experimentation and has already enabled multiple model reproductions and evaluations, thereby enhancing comparability, reproducibility, and progress in multilingual and multi-ontology NER research.

Technology Category

Application Category

📝 Abstract
We present OpenNER 1.0, a standardized collection of openly available named entity recognition (NER) datasets. OpenNER contains 34 datasets spanning 51 languages, annotated in varying named entity ontologies. We correct annotation format issues, standardize the original datasets into a uniform representation, map entity type names to be more consistent across corpora, and provide the collection in a structure that enables research in multilingual and multi-ontology NER. We provide baseline models using three pretrained multilingual language models to compare the performance of recent models and facilitate future research in NER.
Problem

Research questions and friction points this paper is trying to address.

Standardizing multilingual NER datasets for consistency
Providing baseline results using diverse language models
Addressing performance gaps in LLMs for NER tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Standardized multilingual NER datasets
Uniform entity type representation
Baseline results with multiple models
🔎 Similar Papers
No similar papers found.
C
Chester Palen-Michel
Michtom School of Computer Science, Brandeis University
M
Maxwell Pickering
Michtom School of Computer Science, Brandeis University
M
Maya Kruse
Michtom School of Computer Science, Brandeis University
J
Jonne Saleva
Michtom School of Computer Science, Brandeis University
Constantine Lignos
Constantine Lignos
Brandeis University
Computational linguisticsNatural language processingLanguage acquisitionLanguage processingHuman-robot interaction