🤖 AI Summary
The widespread adoption of open-source machine learning (ML) datasets and models has intensified risks including data poisoning, supply-chain attacks, and regulatory non-compliance. Method: This paper proposes the first verifiable, end-to-end ML provenance framework integrating Trusted Execution Environments (TEEs) and transparent logging—built upon SPDX/SLSA standards and leveraging Intel SGX, hash-chain-based immutable logging, and zero-knowledge proofs. Contribution/Results: The framework enables provable artifact authenticity, auditable end-to-end lineage, and co-guaranteed confidentiality and integrity—without compromising intellectual property rights over data or models. Evaluated on two real-world ML pipelines, it achieves 100% metadata tampering detection, full verifiable traceability from training to deployment, and negligible runtime overhead—demonstrating practical viability for secure, compliant ML operations.
📝 Abstract
The rapid adoption of open source machine learning (ML) datasets and models exposes today's AI applications to critical risks like data poisoning and supply chain attacks across the ML lifecycle. With growing regulatory pressure to address these issues through greater transparency, ML model vendors face challenges balancing these requirements against confidentiality for data and intellectual property needs. We propose Atlas, a framework that enables fully attestable ML pipelines. Atlas leverages open specifications for data and software supply chain provenance to collect verifiable records of model artifact authenticity and end-to-end lineage metadata. Atlas combines trusted hardware and transparency logs to enhance metadata integrity, preserve data confidentiality, and limit unauthorized access during ML pipeline operations, from training through deployment. Our prototype implementation of Atlas integrates several open-source tools to build an ML lifecycle transparency system, and assess the practicality of Atlas through two case study ML pipelines.