ESNERA: Empirical and semantic named entity alignment for named entity dataset merging

📅 2025-08-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of inconsistent annotation schemas across multi-source named entity recognition (NER) datasets and the high cost of manual label alignment, this paper proposes a dual-dimensional automatic label alignment method leveraging empirical co-occurrence statistics and semantic similarity. The approach employs pre-trained language models to derive semantic representations of labels, models empirical similarity via entity co-occurrence statistics, and fuses label spaces across datasets using an interpretable greedy pairwise merging strategy. Compared with conventional methods, the proposed framework achieves superior readability, scalability, and computational efficiency. Extensive experiments on three public NER benchmarks and a domain-specific financial dataset demonstrate that models trained on the aligned label space yield substantial performance gains in low-resource settings, while preserving competitive accuracy under high-resource conditions.

Technology Category

Application Category

📝 Abstract
Named Entity Recognition (NER) is a fundamental task in natural language processing. It remains a research hotspot due to its wide applicability across domains. Although recent advances in deep learning have significantly improved NER performance, they rely heavily on large, high-quality annotated datasets. However, building these datasets is expensive and time-consuming, posing a major bottleneck for further research. Current dataset merging approaches mainly focus on strategies like manual label mapping or constructing label graphs, which lack interpretability and scalability. To address this, we propose an automatic label alignment method based on label similarity. The method combines empirical and semantic similarities, using a greedy pairwise merging strategy to unify label spaces across different datasets. Experiments are conducted in two stages: first, merging three existing NER datasets into a unified corpus with minimal impact on NER performance; second, integrating this corpus with a small-scale, self-built dataset in the financial domain. The results show that our method enables effective dataset merging and enhances NER performance in the low-resource financial domain. This study presents an efficient, interpretable, and scalable solution for integrating multi-source NER corpora.
Problem

Research questions and friction points this paper is trying to address.

Aligning named entity labels across diverse datasets
Merging NER datasets with minimal performance impact
Enhancing NER in low-resource domains like finance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines empirical and semantic label similarities
Uses greedy pairwise merging strategy
Enables efficient multi-source NER corpus integration
🔎 Similar Papers
No similar papers found.
X
Xiaobo Zhang
School of Information Engineering, Jiangxi Vocational College of Finance & Economics, Jiujiang, 332000, Jiangxi, China; School of Computer Sciences, Universiti Sains Malaysia, 11700, Penang, Malaysia.
Congqing He
Congqing He
Universiti Sains Malaysia
Natural Language ProcessingTrustworthy AILarge Language ModelsSocial Science
Y
Ying He
School of Information Engineering, Jiangxi Vocational College of Finance & Economics, Jiujiang, 332000, Jiangxi, China; School of Computer Sciences, Universiti Sains Malaysia, 11700, Penang, Malaysia.
J
Jian Peng
School of Information Engineering, Jiangxi Vocational College of Finance & Economics, Jiujiang, 332000, Jiangxi, China.
D
Dajie Fu
School of Information Engineering, Jiangxi Vocational College of Finance & Economics, Jiujiang, 332000, Jiangxi, China.
Tien-Ping Tan
Tien-Ping Tan
Universiti Sains Malaysia
automatic speech recognitionmalaycode switchingnon-nativespeech synthesis