PatchSeeker: Mapping NVD Records to their Vulnerability-fixing Commits with LLM Generated Commits and Embeddings

📅 2025-09-09

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This study addresses the challenge of automatically and precisely mapping National Vulnerability Database (NVD) entries to their corresponding vulnerability-fixing commits (VFCs). To mitigate cross-modal matching bias caused by semantic sparsity in commit messages—a limitation of conventional approaches—we propose a large language model (LLM)-based semantic enhancement framework. Our method leverages LLMs to generate vulnerability-enriched commit summaries, serving as a semantic bridge between natural-language NVD descriptions and code changes, and integrates text embeddings for fine-grained semantic alignment. On standard benchmarks, our approach improves Mean Reciprocal Rank (MRR) by 59.3% and Recall@10 by 27.9% over the state-of-the-art Prospector. It also demonstrates strong generalization on recent CVE data. The core contribution is the first systematic integration of LLM-driven semantic summarization into the VFC–NVD alignment task, significantly enhancing both accuracy and robustness in cross-modal vulnerability localization.

Technology Category

Application Category

📝 Abstract

Software vulnerabilities pose serious risks to modern software ecosystems. While the National Vulnerability Database (NVD) is the authoritative source for cataloging these vulnerabilities, it often lacks explicit links to the corresponding Vulnerability-Fixing Commits (VFCs). VFCs encode precise code changes, enabling vulnerability localization, patch analysis, and dataset construction. Automatically mapping NVD records to their true VFCs is therefore critical. Existing approaches have limitations as they rely on sparse, often noisy commit messages and fail to capture the deep semantics in the vulnerability descriptions. To address this gap, we introduce PatchSeeker, a novel method that leverages large language models to create rich semantic links between vulnerability descriptions and their VFCs. PatchSeeker generates embeddings from NVD descriptions and enhances commit messages by synthesizing detailed summaries for those that are short or uninformative. These generated messages act as a semantic bridge, effectively closing the information gap between natural language reports and low-level code changes. Our approach PatchSeeker achieves 59.3% higher MRR and 27.9% higher Recall@10 than the best-performing baseline, Prospector, on the benchmark dataset. The extended evaluation on recent CVEs further confirms PatchSeeker's effectiveness. Ablation study shows that both the commit message generation method and the selection of backbone LLMs make a positive contribution to PatchSeeker. We also discuss limitations and open challenges to guide future work.

Problem

Research questions and friction points this paper is trying to address.

Mapping NVD records to vulnerability-fixing commits automatically

Bridging information gap between vulnerability descriptions and code changes

Overcoming limitations of sparse noisy commit messages

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-generated commit messages bridge semantic gap

Embeddings from NVD descriptions enhance semantic matching

Synthesized summaries improve information-rich commit representations

🔎 Similar Papers

APPATCH: Automated Adaptive Prompting Large Language Models for Real-World Software Vulnerability Patching