Medical Image De-Identification Resources: Synthetic DICOM Data and Tools for Validation

📅 2025-08-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Medical imaging sharing faces a fundamental trade-off between privacy-preserving de-identification of DICOM files and preserving scientific utility, compounded by the absence of objective, quantitative evaluation tools. To address this, we introduce MIDI—the first synthetic, ground-truth–annotated DICOM dataset containing 53,581 instances from 538 subjects, spanning multi-vendor scanners and multiple cancer types, and rigorously compliant with HIPAA Safe Harbor, DICOM PS3.15, and TCIA standards. Concurrently, we develop an open-source, automated evaluation framework enabling standardized, reproducible, and quantitative assessment of de-identification efficacy. This work establishes the first tripartite evaluation paradigm integrating realistic DICOM structure, synthetically generated protected health information (PHI) and personally identifiable information (PII), and authoritative ground-truth labels—thereby bridging a critical gap left by prevailing subjective, manual review approaches. Our framework significantly enhances regulatory compliance, data security, and research reusability in medical image sharing.

Technology Category

Application Category

📝 Abstract
Medical imaging research increasingly depends on large-scale data sharing to promote reproducibility and train Artificial Intelligence (AI) models. Ensuring patient privacy remains a significant challenge for open-access data sharing. Digital Imaging and Communications in Medicine (DICOM), the global standard data format for medical imaging, encodes both essential clinical metadata and extensive protected health information (PHI) and personally identifiable information (PII). Effective de-identification must remove identifiers, preserve scientific utility, and maintain DICOM validity. Tools exist to perform de-identification, but few assess its effectiveness, and most rely on subjective reviews, limiting reproducibility and regulatory confidence. To address this gap, we developed an openly accessible DICOM dataset infused with synthetic PHI/PII and an evaluation framework for benchmarking image de-identification workflows. The Medical Image de-identification (MIDI) dataset was built using publicly available de-identified data from The Cancer Imaging Archive (TCIA). It includes 538 subjects (216 for validation, 322 for testing), 605 studies, 708 series, and 53,581 DICOM image instances. These span multiple vendors, imaging modalities, and cancer types. Synthetic PHI and PII were embedded into structured data elements, plain text data elements, and pixel data to simulate real-world identity leaks encountered by TCIA curation teams. Accompanying evaluation tools include a Python script, answer keys (known truth), and mapping files that enable automated comparison of curated data against expected transformations. The framework is aligned with the HIPAA Privacy Rule "Safe Harbor" method, DICOM PS3.15 Confidentiality Profiles, and TCIA best practices. It supports objective, standards-driven evaluation of de-identification workflows, promoting safer and more consistent medical image sharing.
Problem

Research questions and friction points this paper is trying to address.

Ensuring patient privacy in medical image sharing
Validating effectiveness of DICOM de-identification tools
Lack of standardized evaluation for PHI/PII removal
Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic DICOM dataset with embedded PHI/PII
Automated evaluation framework for de-identification
Alignment with HIPAA and DICOM standards
🔎 Similar Papers
No similar papers found.
M
Michael W. Rutherford
University of Arkansas for Medical Sciences, Little Rock, Arkansas, USA
T
Tracy Nolan
University of Arkansas for Medical Sciences, Little Rock, Arkansas, USA
L
Linmin Pei
Frederick National Laboratory for Cancer Research, Frederick, Maryland, USA
U
Ulrike Wagner
Frederick National Laboratory for Cancer Research, Frederick, Maryland, USA
Q
Qinyan Pan
Ellumen, Inc., Silver Spring, MD, USA
P
Phillip Farmer
University of Arkansas for Medical Sciences, Little Rock, Arkansas, USA
K
Kirk Smith
University of Arkansas for Medical Sciences, Little Rock, Arkansas, USA
B
Benjamin Kopchick
Deloitte Consulting LLP, New York, NY, USA
L
Laura Opsahl-Ong
Deloitte Consulting LLP, New York, NY, USA
G
Granger Sutton
National Cancer Institute, National Institute of Health (NIH), Bethesda, MD, USA
D
David Clunie
Pixelmed Publishing, Bangor, PA, USA
Keyvan Farahani
Keyvan Farahani
Senior Data Science, Imaging and AI Program Director, NHLBI, NIH
Imaging AI & Image-guided interventions
Fred Prior
Fred Prior
Distinguished Professor and Chair, Department of Biomedical Informatics, University of Arkansas for
quantitative imaginginformatics