π€ AI Summary
This work addresses the heterogeneous noise in distantly supervised named entity recognition (DS-NER) arising from diverse labeling sourcesβe.g., rule-based matching and large language model (LLM) prompting. We first decouple distant supervision noise into two distinct components: the Unlabeled Entity Problem (UEP) and the Noisy Entity Problem (NEP), and model them separately. To this end, we propose a unified noise analysis framework and an end-to-end DS-NER model integrating rule-based supervision, LLM-prompted supervision, dual-task collaborative training, and a noise-aware loss function. Extensive experiments across three data sources, four annotation strategies, and eight real-world distant supervision datasets demonstrate substantial improvements over state-of-the-art methods. Our approach establishes an interpretable and scalable noise-handling paradigm for robust DS-NER modeling.
π Abstract
Distantly supervised named entity recognition (DS-NER) has emerged as a cheap and convenient alternative to traditional human annotation methods, enabling the automatic generation of training data by aligning text with external resources. Despite the many efforts in noise measurement methods, few works focus on the latent noise distribution between different distant annotation methods. In this work, we explore the effectiveness and robustness of DS-NER by two aspects: (1) distant annotation techniques, which encompasses both traditional rule-based methods and the innovative large language model supervision approach, and (2) noise assessment, for which we introduce a novel framework. This framework addresses the challenges by distinctly categorizing them into the unlabeled-entity problem (UEP) and the noisy-entity problem (NEP), subsequently providing specialized solutions for each. Our proposed method achieves significant improvements on eight real-world distant supervision datasets originating from three different data sources and involving four distinct annotation techniques, confirming its superiority over current state-of-the-art methods.