🤖 AI Summary
This work addresses the privacy risks inherent in model explanation methods, which, while enhancing transparency, can inadvertently leak user membership information. The authors propose a stronger explanation-aware membership inference attack and systematically evaluate 15 widely used explanation techniques, revealing that their default configurations exhibit a 74.9% higher privacy leakage risk compared to prior estimates. They further identify attribution sparsity and sensitivity as key sources of this vulnerability. To mitigate these risks, the study introduces a lightweight, model-agnostic hardening strategy comprising sensitivity-calibrated noise injection, attribution clipping, and masking. Experimental results demonstrate that the proposed approach reduces membership leakage by up to 95% while preserving explanation utility, with an average degradation of no more than 3.3%.
📝 Abstract
Machine learning (ML) explainability is central to algorithmic transparency in high-stakes settings such as predictive diagnostics and loan approval. However, these same domains require rigorous privacy guaranties, creating tension between interpretability and privacy. Although prior work has shown that explanation methods can leak membership information, practitioners still lack systematic guidance on selecting or deploying explanation techniques that balance transparency with privacy. We present DeepLeak, a system to audit and mitigate privacy risks in post-hoc explanation methods. DeepLeak advances the state-of-the-art in three ways: (1) comprehensive leakage profiling: we develop a stronger explanation-aware membership inference attack (MIA) to quantify how much representative explanation methods leak membership information under default configurations; (2) lightweight hardening strategies: we introduce practical, model-agnostic mitigations, including sensitivity-calibrated noise, attribution clipping, and masking, that substantially reduce membership leakage while preserving explanation utility; and (3) root-cause analysis: through controlled experiments, we pinpoint algorithmic properties (e.g., attribution sparsity and sensitivity) that drive leakage. Evaluating 15 explanation techniques across four families on image benchmarks, DeepLeak shows that default settings can leak up to 74.9% more membership information than previously reported. Our mitigations cut leakage by up to 95% (minimum 46.5%) with only<=3.3% utility loss on average. DeepLeak offers a systematic, reproducible path to safer explainability in privacy-sensitive ML.