🤖 AI Summary
To address the dual challenges of PII/SPI exposure risks during large language model (LLM) training and cross-jurisdictional data compliance (e.g., GDPR, CCPA), this paper proposes the first regulatory-aware, adaptive PII/SPI mitigation framework. Embedded within a Governance, Risk, and Compliance (GRC) architecture, it employs a context-sensitive dynamic masking mechanism that automatically adjusts anonymization strength per jurisdiction—e.g., strong anonymization under GDPR versus pseudonymization under CCPA—thereby jointly optimizing regulatory compliance and model utility. Key components include a high-accuracy NLP-based PII/SPI detection module, a policy-driven masking engine, a multi-regulation alignment inference module, and an explainable feedback mechanism. Experiments demonstrate state-of-the-art performance: F1-score of 0.95 for passport number identification—significantly surpassing Presidio (0.33) and Amazon Comprehend (0.54)—and an expert-assessed trust score of 4.6/5, confirming its precision, reliability, and practical deployability.
📝 Abstract
Artificial Intelligence (AI) faces growing challenges from evolving data protection laws and enforcement practices worldwide. Regulations like GDPR and CCPA impose strict compliance requirements on Machine Learning (ML) models, especially concerning personal data use. These laws grant individuals rights such as data correction and deletion, complicating the training and deployment of Large Language Models (LLMs) that rely on extensive datasets. Public data availability does not guarantee its lawful use for ML, amplifying these challenges. This paper introduces an adaptive system for mitigating risk of Personally Identifiable Information (PII) and Sensitive Personal Information (SPI) in LLMs. It dynamically aligns with diverse regulatory frameworks and integrates seamlessly into Governance, Risk, and Compliance (GRC) systems. The system uses advanced NLP techniques, context-aware analysis, and policy-driven masking to ensure regulatory compliance. Benchmarks highlight the system's effectiveness, with an F1 score of 0.95 for Passport Numbers, outperforming tools like Microsoft Presidio (0.33) and Amazon Comprehend (0.54). In human evaluations, the system achieved an average user trust score of 4.6/5, with participants acknowledging its accuracy and transparency. Observations demonstrate stricter anonymization under GDPR compared to CCPA, which permits pseudonymization and user opt-outs. These results validate the system as a scalable and robust solution for enterprise privacy compliance.