🤖 AI Summary
In scientific provenance, the disconnection between workflow and data provenance—coupled with inconsistent dimensions and granularity—undermines trustworthiness and reproducibility. To address this, we propose the first unified framework that systematically integrates workflow and data provenance. Our approach introduces a dimension-granularity joint representation model, formally defines the W7+1 provenance problem, and enables domain-adaptable, end-to-end, fine-grained provenance modeling across the full research lifecycle. Evaluated on representative biomedical use cases, the framework achieves traceability from raw data and analytical steps to final results, significantly enhancing transparency, verifiability, and cross-study reproducibility. The core innovation lies in the first-ever orthogonal co-modeling and unified resolution of workflow and data provenance along both dimensional axes (e.g., who, what, when) and granularity levels (e.g., task-level, operation-level, byte-level).
📝 Abstract
Provenance information are essential for the traceability of scientific studies or experiments and thus crucial for ensuring the credibility and reproducibility of research findings. This paper discusses a comprehensive provenance framework combining the two types 1. workflow provenance, and 2. data provenance as well as their dimensions and granularity, which enables the answering of W7+1 provenance questions. We demonstrate the applicability by employing a biomedical research use case, that can be easily transferred into other scientific fields. An integration of these concepts into a unified framework enables credibility and reproducibility of the research findings.