Wild SBOMs: a Large-scale Dataset of Software Bills of Materials from Public Code

📅 2025-03-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Empirical studies on Software Bill of Materials (SBOM) adoption in real-world settings remain scarce, hindering systematic analysis of SBOM quality, compliance, and evolution. Method: We construct the first large-scale, open-source SBOM dataset comprising 78,000 deduplicated SBOMs extracted from 94 million public code repositories. Our automated pipeline integrates scalable crawling, format identification, SBOM-QS–based quality assessment, and provenance metadata annotation—enabling structured collection and multidimensional analysis of “wild” SBOMs across formats, standards, quality dimensions, and version evolution. Contribution/Results: We propose a reproducible empirical research paradigm and evaluation framework for SBOMs, addressing critical gaps in data and methodology for FOSS license compliance and supply-chain security analysis. The dataset is publicly released to support downstream empirical studies on SBOM adoption rates, tool effectiveness, and regulatory compliance.

Technology Category

Application Category

📝 Abstract
Developers gain productivity by reusing readily available Free and Open Source Software (FOSS) components. Such practices also bring some difficulties, such as managing licensing, components and related security. One approach to handle those difficulties is to use Software Bill of Materials (SBOMs). While there have been studies on the readiness of practitioners to embrace SBOMs and on the SBOM tools ecosystem, a large scale study on SBOM practices based on SBOM files produced in the wild is still lacking. A starting point for such a study is a large dataset of SBOM files found in the wild. We introduce such a dataset, consisting of over 78 thousand unique SBOM files, deduplicated from those found in over 94 million repositories. We include metadata that contains the standard and format used, quality score generated by the tool sbomqs, number of revisions, filenames and provenance information. Finally, we give suggestions and examples of research that could bring new insights on assessing and improving SBOM real practices.
Problem

Research questions and friction points this paper is trying to address.

Lack of large-scale study on SBOM practices
Challenges in managing FOSS components and security
Need for dataset to assess and improve SBOM usage
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale dataset of 78k unique SBOM files
Deduplicated from 94 million public repositories
Includes metadata, quality scores, and provenance
🔎 Similar Papers
No similar papers found.