On the Evaluation of Large Language Models in Multilingual Vulnerability Repair

📅 2025-08-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing deep learning–based vulnerability repair methods are largely confined to single languages (e.g., C/C++), and the cross-lingual generalization capability of large language models (LLMs) remains insufficiently validated. Method: This work conducts the first large-scale empirical study on multilingual vulnerability repair across seven programming languages, integrating instruction tuning, few-shot prompting, and state-of-the-art LLMs—including GPT-4o. Contribution/Results: GPT-4o demonstrates strong cross-lingual generalization, achieving the highest repair success rate on Go. It significantly outperforms prior approaches on high-severity and semantically distinctive vulnerabilities. This study establishes the first benchmarked empirical foundation for multilingual automated vulnerability repair and highlights the critical role of LLMs’ semantic understanding capabilities in software security.

Technology Category

Application Category

📝 Abstract
Various Deep Learning-based approaches with pre-trained language models have been proposed for automatically repairing software vulnerabilities. However, these approaches are limited to a specific programming language (C/C++). Recent advances in large language models (LLMs) offer language-agnostic capabilities and strong semantic understanding, exhibiting potential to overcome multilingual vulnerability limitations. Although some work has begun to explore LLMs' repair performance, their effectiveness is unsatisfactory. To address these limitations, we conducted a large-scale empirical study to investigate the performance of automated vulnerability repair approaches and state-of-the-art LLMs across seven programming languages. Results show GPT-4o, instruction-tuned with few-shot prompting, performs competitively against the leading approach, VulMaster. Additionally, the LLM-based approach shows superior performance in repairing unique vulnerabilities and is more likely to repair the most dangerous vulnerabilities. Instruction-tuned GPT-4o demonstrates strong generalization on vulnerabilities in previously unseen language, outperforming existing approaches. Analysis shows Go consistently achieves the highest effectiveness across all model types, while C/C++ performs the worst. Based on findings, we discuss the promise of LLM on multilingual vulnerability repair and the reasons behind LLM's failed cases. This work takes the first look at repair approaches and LLMs across multiple languages, highlighting the promising future of adopting LLMs for multilingual vulnerability repair.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs for multilingual vulnerability repair across languages
Assessing GPT-4o's performance against VulMaster in vulnerability fixes
Analyzing LLM effectiveness and failure cases in diverse programming languages
Innovation

Methods, ideas, or system contributions that make the work stand out.

Instruction-tuned GPT-4o for multilingual repair
Few-shot prompting enhances vulnerability fixing
LLMs outperform in dangerous vulnerability repair
🔎 Similar Papers
No similar papers found.
D
Dong Wang
College of Intelligence and Computing, Tianjin University, China
J
Junji Yu
College of Intelligence and Computing, Tianjin University, China
Honglin Shu
Honglin Shu
Kyushu University
AI4SE
Michael Fu
Michael Fu
The University of Melbourne
Software EngineeringDevSecOpsDeep LearningLanguage Models
C
Chakkrit Tantithamthavorn
Information Technology, Monash University, Australia
Yasutaka Kamei
Yasutaka Kamei
Professor, Kyushu University, InaRIS Fellow
Software EngineeringEmpirical Software EngineeringMining Software RepositoriesSoftware Quality
J
Junjie Chen
College of Intelligence and Computing, Tianjin University, China