🤖 AI Summary
This study addresses the critical issue of reproducibility in open-source software builds and its implications for software supply chain security. By systematically analyzing source code discrepancies across 85 versions of 28 widely used Java packages between Maven Central and independently built distributions (e.g., from Google or Oracle), the work reveals—for the first time—that dynamic code generation during the build process is the primary cause of irreproducibility, thereby challenging the prevailing assumption that identical source code guarantees consistent rebuilds. Through comprehensive source code comparison, build process analysis, and root-cause attribution, complemented by an in-depth examination of Maven’s extension mechanisms, the authors propose targeted mitigation strategies that substantially enhance build reproducibility and strengthen trust in the software supply chain.
📝 Abstract
Rebuilding packages from open source is a common practice to improve the security of software supply chains, and is now done at an industrial scale. The basic principle is to acquire the source code used to build a package published in a repository such as Maven Central (for Java), rebuild the package independently with hardened security, and publish it in some alternative repository. In this paper we test the assumption that the same source code is being used by those alternative builds. To study this, we compare the sources released with packages on Maven Central, with the sources associated with independently built packages from Google's Assured Open Source and Oracle's Build-from-Source projects. We study non-equivalent sources for alternative builds of 28 popular packages with 85 releases. We investigate the causes of non-equivalence, and find that the main cause is build extensions that generate code at build time, which are difficult to reproduce. We suggest strategies to address this issue.