🤖 AI Summary
Machine unlearning in clinical disease reasoning faces significant challenges in multi-label medical settings, particularly due to unclear efficacy and the absence of realistic evaluation benchmarks. This work addresses this gap by introducing the first machine unlearning benchmark grounded in the real-world clinical database MIMIC-III, explicitly designed to reflect key medical characteristics, longitudinal data structures, safety constraints, and the complexities of multi-label classification, while encompassing diverse unlearning scenarios. Through systematic evaluation of multiple existing unlearning algorithms, the study reveals a pronounced trade-off between utility preservation and completeness of forgetting, with most methods proving poorly suited to multi-label clinical tasks. The benchmark is publicly released to foster reproducible, clinically oriented research in machine unlearning.
📝 Abstract
Language models trained for clinical disease inference are trained on patient data, which may include sensitive and private information, and data owners may request the removal of their data from a trained model due to privacy or copyright concerns. However, exactly unlearning patient-specific data is intractable, and retraining with minor data removal is resource-intensive. While there exists several machine unlearning methods that can be used, their utility is generally restricted to non-medical domains. Moreover, the existing benchmarks for evaluating such unlearning methods primarily utilize synthetically curated datasets, which are not truly representative of real-world systems. Hence, the effectiveness of these unlearning methods in the medical domain is largely unclear. To this end, we introduce REMEDI, an extensive benchmark for machine unlearning tailored to multi-label and multiclass clinical disease inference, where label correlations, longitudinal structure, and safety constraints make unlearning particularly challenging. Unlike the existing benchmarks, REMEDI considers: (1) a relevant application domain (medical), (2) comprehensive unlearning setups involving diverse sets of forget instances, (3) challenging unlearning scenarios including multi-label and multi-class classification tasks, and (4) evaluation metrics involving performance both in terms of utility and extent of unlearning achieved. REMEDI is developed using the MIMIC-III clinical database that contains comprehensive clinical data of patients. Experiments with existing unlearning methods indicate that there exists a trade-off between utility and unlearning performance. They are also largely unsuited to multi-label classification tasks. To facilitate reproducibility, we make our benchmark publicly available.