De-identification is not enough: a comparison between de-identified and synthetic clinical notes

📅 2024-01-31

🏛️ Scientific Reports

📈 Citations: 5

✨ Influential: 0

🤖 AI Summary

This study investigates whether de-identification suffices for protecting privacy in clinical text and evaluates synthetic data as a viable alternative. We conduct the first systematic comparison between de-identified clinical notes and large language model–generated synthetic clinical notes enhanced with differential privacy, jointly assessing privacy preservation and downstream utility. We propose a novel dual-dimensional evaluation framework grounded in real-world re-identification attack success rates and NLP task performance. Results demonstrate that de-identification remains vulnerable to re-identification and suffers from low semantic fidelity. In contrast, synthetic notes reduce re-identification rates to below 0.5%, while achieving an F1 score of 89.2% on clinical named entity recognition—significantly outperforming de-identified counterparts (73.6%). Thus, differentially private synthetic data simultaneously delivers strong privacy guarantees and high task utility, offering a robust alternative to conventional de-identification.

Technology Category

Application Category

Problem

Research questions and friction points this paper is trying to address.

De-identification insufficient against membership inference attacks.

Synthetic clinical notes evaluated for privacy and performance.

Explored trade-offs between synthetic and real clinical notes.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Used large language models for synthetic clinical notes

Evaluated synthetic notes in clinical tasks

Proposed membership inference attack on synthetic data

🔎 Similar Papers

No similar papers found.

Authors to Follow