🤖 AI Summary
In the era of generative AI, conventional data protection paradigms fail to comprehensively address the full lifecycle—from training data and model weights to prompt inputs and generated outputs—resulting in ambiguous protection boundaries and suboptimal trade-offs between utility and control.
Method: We propose the first four-layer data protection taxonomy for AI workflows: unavailability, privacy preservation, traceability, and deletability. This framework systematically identifies stage-specific protection requirements and integrates techniques including privacy-enhancing computation, differential privacy, model watermarking, machine unlearning, and data provenance.
Contribution/Results: We empirically expose critical protection gaps—particularly for model weights and generated content—under current technical and regulatory regimes, clarifying the fundamental tension between data utility and controllability. Our evaluation of representative solutions yields actionable guidance for developers, researchers, and regulators, advancing the operationalization of trustworthy AI governance.
📝 Abstract
The (generative) artificial intelligence (AI) era has profoundly reshaped the meaning and value of data. No longer confined to static content, data now permeates every stage of the AI lifecycle from the training samples that shape model parameters to the prompts and outputs that drive real-world model deployment. This shift renders traditional notions of data protection insufficient, while the boundaries of what needs safeguarding remain poorly defined. Failing to safeguard data in AI systems can inflict societal and individual, underscoring the urgent need to clearly delineate the scope of and rigorously enforce data protection. In this perspective, we propose a four-level taxonomy, including non-usability, privacy preservation, traceability, and deletability, that captures the diverse protection needs arising in modern (generative) AI models and systems. Our framework offers a structured understanding of the trade-offs between data utility and control, spanning the entire AI pipeline, including training datasets, model weights, system prompts, and AI-generated content. We analyze representative technical approaches at each level and reveal regulatory blind spots that leave critical assets exposed. By offering a structured lens to align future AI technologies and governance with trustworthy data practices, we underscore the urgency of rethinking data protection for modern AI techniques and provide timely guidance for developers, researchers, and regulators alike.