🤖 AI Summary
This study presents the first systematic empirical investigation of Software Bill of Materials (SBOM) publishing practices among developers on Maven Central. Addressing the lack of large-scale, repository-level SBOM analysis, we propose a warehouse-oriented SBOM discovery and dependency graph augmentation method: leveraging the Goblin framework to construct Maven dependency graphs, and integrating the Weaver module for automated SBOM (SPDX/CycloneDX) parsing, graph traversal-based sampling, and multi-source data fusion. From a 10% stratified sample of repository nodes, we collected 14,071 SBOMs covering 7,290 package versions, establishing the first publicly available Maven SBOM dataset. Results reveal critically low SBOM adoption rates and severe format fragmentation. Key contributions include: (1) the first empirically grounded, multi-source SBOM dataset; (2) a scalable, package-level SBOM discovery and graph-augmentation framework; and (3) evidence-based insights for enhancing transparency in open-source software supply chains.
📝 Abstract
Software Bills of Materials (SBOMs) are essential to ensure the transparency and integrity of the software supply chain. There is a growing body of work that investigates the accuracy of SBOM generation tools and the challenges for producing complete SBOMs. Yet, there is little knowledge about how developers distribute SBOMs. In this work, we mine SBOMs from Maven Central to assess the extent to which developers publish SBOMs along with the artifacts. We develop our work on top of the Goblin framework, which consists of a Maven Central dependency graph and a Weaver that allows augmenting the dependency graph with additional data. For this study, we select a sample of 10% of release nodes from the Maven Central dependency graph and collected 14,071 SBOMs from 7,290 package releases. We then augment the Maven Central dependency graph with the collected SBOMs. We present our methodology to mine SBOMs, as well as novel insights about SBOM publication. Our dataset is the first set of SBOMs collected from a package registry. We make it available as a standalone dataset, which can be used for future research about SBOMs and package distribution.