VulCoCo: A Simple Yet Effective Method for Detecting Vulnerable Code Clones

📅 2025-07-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Code reuse frequently propagates known vulnerability patterns—termed Vulnerable Code Clones (VCC)—across projects, yet existing detection tools either rely solely on syntactic similarity or produce coarse-grained, uninterpretable vulnerability predictions. To address this, we propose a lightweight two-stage approach: first, efficient similarity retrieval via code embeddings; second, fine-grained, semantic-level vulnerability verification of candidate clones using large language models. To enable rigorous evaluation, we construct the first reproducible VCC synthetic benchmark. Experiments demonstrate that our method significantly outperforms state-of-the-art techniques in both precision and interpretability. In real-world deployment, we submitted 400 pull requests; 75 were merged, and 15 led to assigned CVEs—validating both technical efficacy and practical impact.

Technology Category

Application Category

📝 Abstract
Code reuse is common in modern software development, but it can also spread vulnerabilities when developers unknowingly copy risky code. The code fragments that preserve the logic of known vulnerabilities are known as vulnerable code clones (VCCs). Detecting those VCCs is a critical but challenging task. Existing VCC detection tools often rely on syntactic similarity or produce coarse vulnerability predictions without clear explanations, limiting their practical utility. In this paper, we propose VulCoCo, a lightweight and scalable approach that combines embedding-based retrieval with large language model (LLM) validation. Starting from a set of known vulnerable functions, we retrieve syntactically or semantically similar candidate functions from a large corpus and use an LLM to assess whether the candidates retain the vulnerability. Given that there is a lack of reproducible vulnerable code clone benchmarks, we first construct a synthetic benchmark that spans various clone types. Our experiments on the benchmark show that VulCoCo outperforms prior state-of-the-art methods in terms of Precision@k and mean average precision (MAP). In addition, we also demonstrate VulCoCo's effectiveness in real-world projects by submitting 400 pull requests (PRs) to 284 open-source projects. Among them, 75 PRs were merged, and 15 resulted in newly published CVEs. We also provide insights to inspire future work to further improve the precision of vulnerable code clone detection.
Problem

Research questions and friction points this paper is trying to address.

Detect vulnerable code clones in software development
Improve precision of existing VCC detection tools
Validate vulnerabilities using embedding and LLM techniques
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines embedding-based retrieval with LLM validation
Constructs synthetic benchmark for various clone types
Outperforms prior methods in Precision@k and MAP
🔎 Similar Papers
No similar papers found.
T
Tan Bui
Singapore Management University, Singapore
Y
Yan Naing Tun
Singapore Management University, Singapore
T
Thanh Phuc Nguyen
Singapore Management University, Singapore
Yindu Su
Yindu Su
Xiaohongshu Inc.
Ferdian Thung
Ferdian Thung
Research Scientist, School of Information Systems, Singapore Management University
Software EngineeringData Mining
Yikun Li
Yikun Li
Postdoctoral Researcher
Artificial intelligenceSoftware EngineeringCyber Security
H
Han Wei Ang
GovTech, Singapore
Y
Yide Yin
GovTech, Singapore
Frank Liauw
Frank Liauw
Lead Cybersecurity Engineer, Government Technology Agency Singapore
L
Lwin Khin Shar
Singapore Management University, Singapore
E
Eng Lieh Ouh
Singapore Management University, Singapore
T
Ting Zhang
Singapore Management University, Singapore
D
David Lo
Singapore Management University, Singapore