🤖 AI Summary
Mapping NVD entries to vulnerability-fixing commits (VFCs) is challenging due to sparse and heterogeneous reference links. This paper first systematically quantifies the disparity in mapping effectiveness between Git and non-Git links. We propose a multi-source collaborative mapping framework that integrates NVD, six major security databases, and GitHub repositories, incorporating rule-based parsing, URL normalization, cross-source deduplication, and precision validation. Evaluated on 26,710 unique NVD records (11.3% coverage) across 7,634 projects, our approach achieves up to 88.4% precision. Our analysis reveals that 88.7% of NVD entries lacking Git links remain unmappable by current automated methods—exposing a fundamental bottleneck in automated vulnerability-to-commit linkage. The study delivers a reproducible methodology and empirical benchmark for vulnerability tracking and patch analysis.
📝 Abstract
Mapping National Vulnerability Database (NVD) records to vulnerability-fixing commits (VFCs) is crucial for vulnerability analysis but challenging due to sparse explicit links in NVD references.This study explores this mapping's feasibility through an empirical approach. Manual analysis of NVD references showed Git references enable over 86% success, while non-Git references achieve under 14%. Using these findings, we built an automated pipeline extracting 31,942 VFCs from 20,360 NVD records (8.7% of 235,341) with 87% precision, mainly from Git references. To fill gaps, we mined six external security databases, yielding 29,254 VFCs for 18,985 records (8.1%) at 88.4% precision, and GitHub repositories, adding 3,686 VFCs for 2,795 records (1.2%) at 73% precision. Combining these, we mapped 26,710 unique records (11.3% coverage) from 7,634 projects, with overlap between NVD and external databases, plus unique GitHub contributions. Despite success with Git references, 88.7% of records remain unmapped, highlighting the difficulty without Git links. This study offers insights for enhancing vulnerability datasets and guiding future automated security research.