Medical Image De-Identification Resources: Synthetic DICOM Data and Tools for Validation

📅 2025-08-03

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Medical imaging sharing faces a fundamental trade-off between privacy-preserving de-identification of DICOM files and preserving scientific utility, compounded by the absence of objective, quantitative evaluation tools. To address this, we introduce MIDI—the first synthetic, ground-truth–annotated DICOM dataset containing 53,581 instances from 538 subjects, spanning multi-vendor scanners and multiple cancer types, and rigorously compliant with HIPAA Safe Harbor, DICOM PS3.15, and TCIA standards. Concurrently, we develop an open-source, automated evaluation framework enabling standardized, reproducible, and quantitative assessment of de-identification efficacy. This work establishes the first tripartite evaluation paradigm integrating realistic DICOM structure, synthetically generated protected health information (PHI) and personally identifiable information (PII), and authoritative ground-truth labels—thereby bridging a critical gap left by prevailing subjective, manual review approaches. Our framework significantly enhances regulatory compliance, data security, and research reusability in medical image sharing.

Technology Category

Application Category

📝 Abstract

Medical imaging research increasingly depends on large-scale data sharing to promote reproducibility and train Artificial Intelligence (AI) models. Ensuring patient privacy remains a significant challenge for open-access data sharing. Digital Imaging and Communications in Medicine (DICOM), the global standard data format for medical imaging, encodes both essential clinical metadata and extensive protected health information (PHI) and personally identifiable information (PII). Effective de-identification must remove identifiers, preserve scientific utility, and maintain DICOM validity. Tools exist to perform de-identification, but few assess its effectiveness, and most rely on subjective reviews, limiting reproducibility and regulatory confidence. To address this gap, we developed an openly accessible DICOM dataset infused with synthetic PHI/PII and an evaluation framework for benchmarking image de-identification workflows. The Medical Image de-identification (MIDI) dataset was built using publicly available de-identified data from The Cancer Imaging Archive (TCIA). It includes 538 subjects (216 for validation, 322 for testing), 605 studies, 708 series, and 53,581 DICOM image instances. These span multiple vendors, imaging modalities, and cancer types. Synthetic PHI and PII were embedded into structured data elements, plain text data elements, and pixel data to simulate real-world identity leaks encountered by TCIA curation teams. Accompanying evaluation tools include a Python script, answer keys (known truth), and mapping files that enable automated comparison of curated data against expected transformations. The framework is aligned with the HIPAA Privacy Rule "Safe Harbor" method, DICOM PS3.15 Confidentiality Profiles, and TCIA best practices. It supports objective, standards-driven evaluation of de-identification workflows, promoting safer and more consistent medical image sharing.

Problem

Research questions and friction points this paper is trying to address.

Ensuring patient privacy in medical image sharing

Validating effectiveness of DICOM de-identification tools

Lack of standardized evaluation for PHI/PII removal

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic DICOM dataset with embedded PHI/PII

Automated evaluation framework for de-identification

Alignment with HIPAA and DICOM standards

🔎 Similar Papers

No similar papers found.