🤖 AI Summary
To address privacy leakage risks in medical image sharing via DICOM files, this paper proposes an AI–rule hybrid fine-grained de-identification framework. Methodologically, structured DICOM fields are processed by an interpretable rule engine—grounded in the TCIA guidelines and extended with private tag management—while free-text fields are analyzed by a lightweight RoBERTa model for PII/PHI detection; embedded textual content in images is extracted using PaddleOCR. DICOM standard compliance is rigorously enforced throughout via dciodvfy. The key contribution is the first dedicated application of RoBERTa to sensitive information identification in unstructured clinical text, enabling decoupled, efficient processing of structured and unstructured data without model overloading. Evaluated on the MIDI-B benchmark, the framework achieves 99.91% de-identification accuracy, significantly improving precision, interpretability, and regulatory compliance in medical data anonymization.
📝 Abstract
Ensuring the de-identification of medical imaging data is a critical step in enabling safe data sharing. This paper presents a hybrid de-identification framework designed to process Digital Imaging and Communications in Medicine (DICOM) files. Our framework adopts a modified, pre-built rule-based component, updated with The Cancer Imaging Archive (TCIA)'s best practices guidelines, as outlined in DICOM PS 3.15, for improved performance. It incorporates PaddleOCR, a robust Optical Character Recognition (OCR) system for extracting text from images, and RoBERTa, a fine-tuned transformer-based model for identifying and removing Personally Identifiable Information (PII) and Protected Health Information (PHI). Initially, the transformer-based model and the rule-based component were integrated to process for both structured data and free text. However, this coarse-grained approach did not yield optimal results. To improve performance, we refined our approach by applying the transformer model exclusively to free text, while structured data was handled only by rule-based methods. In this framework the DICOM validator dciodvfy was leveraged to ensure the integrity of DICOM files after the deID process. Through iterative refinement, including the incorporation of custom rules and private tag handling, the framework achieved a de-identification accuracy of 99.91% on the MIDI-B test dataset. The results demonstrate the effectiveness of combining rule-based compliance with AI-enabled adaptability in addressing the complex challenges of DICOM de-identification.