🤖 AI Summary
Medical imaging sharing faces a fundamental trade-off between privacy-preserving de-identification of DICOM files and preserving scientific utility, compounded by the absence of objective, quantitative evaluation tools. To address this, we introduce MIDI—the first synthetic, ground-truth–annotated DICOM dataset containing 53,581 instances from 538 subjects, spanning multi-vendor scanners and multiple cancer types, and rigorously compliant with HIPAA Safe Harbor, DICOM PS3.15, and TCIA standards. Concurrently, we develop an open-source, automated evaluation framework enabling standardized, reproducible, and quantitative assessment of de-identification efficacy. This work establishes the first tripartite evaluation paradigm integrating realistic DICOM structure, synthetically generated protected health information (PHI) and personally identifiable information (PII), and authoritative ground-truth labels—thereby bridging a critical gap left by prevailing subjective, manual review approaches. Our framework significantly enhances regulatory compliance, data security, and research reusability in medical image sharing.
📝 Abstract
Medical imaging research increasingly depends on large-scale data sharing to promote reproducibility and train Artificial Intelligence (AI) models. Ensuring patient privacy remains a significant challenge for open-access data sharing. Digital Imaging and Communications in Medicine (DICOM), the global standard data format for medical imaging, encodes both essential clinical metadata and extensive protected health information (PHI) and personally identifiable information (PII). Effective de-identification must remove identifiers, preserve scientific utility, and maintain DICOM validity. Tools exist to perform de-identification, but few assess its effectiveness, and most rely on subjective reviews, limiting reproducibility and regulatory confidence. To address this gap, we developed an openly accessible DICOM dataset infused with synthetic PHI/PII and an evaluation framework for benchmarking image de-identification workflows. The Medical Image de-identification (MIDI) dataset was built using publicly available de-identified data from The Cancer Imaging Archive (TCIA). It includes 538 subjects (216 for validation, 322 for testing), 605 studies, 708 series, and 53,581 DICOM image instances. These span multiple vendors, imaging modalities, and cancer types. Synthetic PHI and PII were embedded into structured data elements, plain text data elements, and pixel data to simulate real-world identity leaks encountered by TCIA curation teams. Accompanying evaluation tools include a Python script, answer keys (known truth), and mapping files that enable automated comparison of curated data against expected transformations. The framework is aligned with the HIPAA Privacy Rule "Safe Harbor" method, DICOM PS3.15 Confidentiality Profiles, and TCIA best practices. It supports objective, standards-driven evaluation of de-identification workflows, promoting safer and more consistent medical image sharing.