DynClean: Training Dynamics-based Label Cleaning for Distantly-Supervised Named Entity Recognition

📅 2025-04-06

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

Distantly supervised named entity recognition (DS-NER) suffers from severe label noise due to inaccurate and incomplete distant supervision annotations. While existing approaches primarily focus on robust modeling, they largely overlook label cleaning—a critical yet underexplored direction. This paper introduces training dynamics—the evolution of per-sample loss over training epochs—to DS-NER for the first time, proposing a fully automated, intervention-free dynamic label cleaning framework. By modeling loss trajectories, the method characterizes how the model learns from each annotated instance; an adaptive threshold estimation strategy is then designed to precisely identify and remove mislabeled samples. Evaluated on four standard benchmarks, our approach achieves consistent F1 improvements of 3.18%–8.95%, substantially outperforming state-of-the-art DS-NER methods. This work establishes a novel paradigm for learning with noisy labels in NER, shifting emphasis from robustness under noise to proactive noise mitigation through dynamic label refinement.

Technology Category

Application Category

📝 Abstract

Distantly Supervised Named Entity Recognition (DS-NER) has attracted attention due to its scalability and ability to automatically generate labeled data. However, distant annotation introduces many mislabeled instances, limiting its performance. Most of the existing work attempt to solve this problem by developing intricate models to learn from the noisy labels. An alternative approach is to attempt to clean the labeled data, thus increasing the quality of distant labels. This approach has received little attention for NER. In this paper, we propose a training dynamics-based label cleaning approach, which leverages the behavior of a model as training progresses to characterize the distantly annotated samples. We also introduce an automatic threshold estimation strategy to locate the errors in distant labels. Extensive experimental results demonstrate that: (1) models trained on our cleaned DS-NER datasets, which were refined by directly removing identified erroneous annotations, achieve significant improvements in F1-score, ranging from 3.18% to 8.95%; and (2) our method outperforms numerous advanced DS-NER approaches across four datasets.

Problem

Research questions and friction points this paper is trying to address.

Improving label quality in distantly-supervised NER

Reducing mislabeled instances in automated annotations

Enhancing model performance via training dynamics analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training dynamics-based label cleaning approach

Automatic threshold estimation for error detection

Direct removal of identified erroneous annotations

🔎 Similar Papers

Improving the Robustness of Distantly-Supervised Named Entity Recognition via Uncertainty-Aware Teacher Learning and Student-Student Collaborative Learning