Towards DS-NER: Unveiling and Addressing Latent Noise in Distant Annotations

📅 2025-05-18

🏛️ IEEE Transactions on Knowledge and Data Engineering

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This work addresses the heterogeneous noise in distantly supervised named entity recognition (DS-NER) arising from diverse labeling sources—e.g., rule-based matching and large language model (LLM) prompting. We first decouple distant supervision noise into two distinct components: the Unlabeled Entity Problem (UEP) and the Noisy Entity Problem (NEP), and model them separately. To this end, we propose a unified noise analysis framework and an end-to-end DS-NER model integrating rule-based supervision, LLM-prompted supervision, dual-task collaborative training, and a noise-aware loss function. Extensive experiments across three data sources, four annotation strategies, and eight real-world distant supervision datasets demonstrate substantial improvements over state-of-the-art methods. Our approach establishes an interpretable and scalable noise-handling paradigm for robust DS-NER modeling.

Technology Category

Application Category

📝 Abstract

Distantly supervised named entity recognition (DS-NER) has emerged as a cheap and convenient alternative to traditional human annotation methods, enabling the automatic generation of training data by aligning text with external resources. Despite the many efforts in noise measurement methods, few works focus on the latent noise distribution between different distant annotation methods. In this work, we explore the effectiveness and robustness of DS-NER by two aspects: (1) distant annotation techniques, which encompasses both traditional rule-based methods and the innovative large language model supervision approach, and (2) noise assessment, for which we introduce a novel framework. This framework addresses the challenges by distinctly categorizing them into the unlabeled-entity problem (UEP) and the noisy-entity problem (NEP), subsequently providing specialized solutions for each. Our proposed method achieves significant improvements on eight real-world distant supervision datasets originating from three different data sources and involving four distinct annotation techniques, confirming its superiority over current state-of-the-art methods.

Problem

Research questions and friction points this paper is trying to address.

Exploring latent noise distribution in distant annotation methods for NER

Assessing effectiveness of rule-based vs LLM-based distant annotation techniques

Addressing unlabeled-entity and noisy-entity problems in distant supervision NER

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines rule-based and LLM supervision methods

Introduces framework for UEP and NEP categorization

Achieves improvements on multiple real-world datasets

🔎 Similar Papers

Improving the Robustness of Distantly-Supervised Named Entity Recognition via Uncertainty-Aware Teacher Learning and Student-Student Collaborative Learning