Data-Constrained Synthesis of Training Data for De-Identification

📅 2025-02-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In privacy-sensitive domains such as clinical NLP, scarce access to real-world annotated data hinders the development of robust de-identification systems. Method: We propose a synthetic data construction paradigm: domain-adapted large language models (LLMs) generate clinical text, and encoder-based NER models (e.g., BERT-CRF) automatically annotate PII entities, yielding synthetic corpora for training de-identification NER models. Contribution/Results: We find that machine annotation quality—not synthetic data scale—determines the upper bound of downstream NER performance; only minimal real data is required for effective LLM domain adaptation. Cross-lingual ablation studies (Swedish/Spanish) show that NER models trained on synthetic data achieve performance close to real-data baselines, with annotation quality being the dominant factor. This work provides an efficient, controllable, and reproducible data substitution framework for privacy-constrained settings.

Technology Category

Application Category

📝 Abstract
Many sensitive domains -- such as the clinical domain -- lack widely available datasets due to privacy risks. The increasing generative capabilities of large language models (LLMs) have made synthetic datasets a viable path forward. In this study, we domain-adapt LLMs to the clinical domain and generate synthetic clinical texts that are machine-annotated with tags for personally identifiable information using capable encoder-based NER models. The synthetic corpora are then used to train synthetic NER models. The results show that training NER models using synthetic corpora incurs only a small drop in predictive performance. The limits of this process are investigated in a systematic ablation study -- using both Swedish and Spanish data. Our analysis shows that smaller datasets can be sufficient for domain-adapting LLMs for data synthesis. Instead, the effectiveness of this process is almost entirely contingent on the performance of the machine-annotating NER models trained using the original data.
Problem

Research questions and friction points this paper is trying to address.

Generating synthetic clinical texts
Training NER models with synthetic data
Assessing performance drop in NER models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Domain-adapted LLMs generate synthetic data
Machine-annotated PII tags enhance accuracy
Synthetic NER models maintain high performance
🔎 Similar Papers
No similar papers found.