Trustworthy Provenance for Big Data Science: a Modular Architecture Leveraging Blockchain in Federated Settings

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address challenges of insufficient scientific data provenance integrity and weak cross-organizational interoperability in multi-institutional collaborative research, this paper proposes a federated provenance architecture integrated with a permissioned blockchain. The architecture adopts a modular, domain-agnostic design, incorporating persistent identifiers (PIDs), versioned provenance graph modeling, and federated computation mechanisms—ensuring decentralized interaction while guaranteeing immutability, long-term auditability, and cross-platform verifiability of provenance data. Unlike existing approaches, our work is the first to deeply embed a permissioned blockchain into the federated provenance workflow, thereby overcoming provenance consistency bottlenecks imposed by organizational boundaries. Evaluation of a prototype system demonstrates significant improvements in transparency, accountability, and reproducibility of cross-institutional research data, establishing foundational infrastructure for trustworthy large-scale scientific data analysis.

Technology Category

Application Category

📝 Abstract
Ensuring the trustworthiness and long-term verifiability of scientific data is a foundational challenge in the era of data-intensive, collaborative research. Provenance metadata plays a key role in this context, capturing the origin, transformation, and usage of research artifacts. However, existing solutions often fall short when applied to distributed, multi-institutional settings. This paper introduces a modular, domain-agnostic architecture for provenance tracking in federated environments, leveraging permissioned blockchain infrastructure to guarantee integrity, immutability, and auditability. The system supports decentralized interaction, persistent identifiers for artifact traceability, and a provenance versioning model that preserves the history of updates. Designed to interoperate with diverse scientific domains, the architecture promotes transparency, accountability, and reproducibility across organizational boundaries. Ongoing work focuses on validating the system through a distributed prototype and exploring its performance in collaborative settings.
Problem

Research questions and friction points this paper is trying to address.

Ensuring trustworthiness in collaborative big data science
Tracking provenance in distributed multi-institutional research settings
Guaranteeing data integrity using blockchain in federated environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modular architecture for federated provenance tracking
Permissioned blockchain ensures integrity and auditability
Decentralized interaction with persistent artifact identifiers
🔎 Similar Papers
No similar papers found.