Towards DS-NER: Unveiling and Addressing Latent Noise in Distant Annotations

πŸ“… 2025-05-18
πŸ›οΈ IEEE Transactions on Knowledge and Data Engineering
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the heterogeneous noise in distantly supervised named entity recognition (DS-NER) arising from diverse labeling sourcesβ€”e.g., rule-based matching and large language model (LLM) prompting. We first decouple distant supervision noise into two distinct components: the Unlabeled Entity Problem (UEP) and the Noisy Entity Problem (NEP), and model them separately. To this end, we propose a unified noise analysis framework and an end-to-end DS-NER model integrating rule-based supervision, LLM-prompted supervision, dual-task collaborative training, and a noise-aware loss function. Extensive experiments across three data sources, four annotation strategies, and eight real-world distant supervision datasets demonstrate substantial improvements over state-of-the-art methods. Our approach establishes an interpretable and scalable noise-handling paradigm for robust DS-NER modeling.

Technology Category

Application Category

πŸ“ Abstract
Distantly supervised named entity recognition (DS-NER) has emerged as a cheap and convenient alternative to traditional human annotation methods, enabling the automatic generation of training data by aligning text with external resources. Despite the many efforts in noise measurement methods, few works focus on the latent noise distribution between different distant annotation methods. In this work, we explore the effectiveness and robustness of DS-NER by two aspects: (1) distant annotation techniques, which encompasses both traditional rule-based methods and the innovative large language model supervision approach, and (2) noise assessment, for which we introduce a novel framework. This framework addresses the challenges by distinctly categorizing them into the unlabeled-entity problem (UEP) and the noisy-entity problem (NEP), subsequently providing specialized solutions for each. Our proposed method achieves significant improvements on eight real-world distant supervision datasets originating from three different data sources and involving four distinct annotation techniques, confirming its superiority over current state-of-the-art methods.
Problem

Research questions and friction points this paper is trying to address.

Exploring latent noise distribution in distant annotation methods for NER
Assessing effectiveness of rule-based vs LLM-based distant annotation techniques
Addressing unlabeled-entity and noisy-entity problems in distant supervision NER
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines rule-based and LLM supervision methods
Introduces framework for UEP and NEP categorization
Achieves improvements on multiple real-world datasets
πŸ”Ž Similar Papers
No similar papers found.