On the Evaluation of Large Language Models in Multilingual Vulnerability Repair

📅 2025-08-05

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Existing deep learning–based vulnerability repair methods are largely confined to single languages (e.g., C/C++), and the cross-lingual generalization capability of large language models (LLMs) remains insufficiently validated. Method: This work conducts the first large-scale empirical study on multilingual vulnerability repair across seven programming languages, integrating instruction tuning, few-shot prompting, and state-of-the-art LLMs—including GPT-4o. Contribution/Results: GPT-4o demonstrates strong cross-lingual generalization, achieving the highest repair success rate on Go. It significantly outperforms prior approaches on high-severity and semantically distinctive vulnerabilities. This study establishes the first benchmarked empirical foundation for multilingual automated vulnerability repair and highlights the critical role of LLMs’ semantic understanding capabilities in software security.

Technology Category

Application Category

📝 Abstract

Various Deep Learning-based approaches with pre-trained language models have been proposed for automatically repairing software vulnerabilities. However, these approaches are limited to a specific programming language (C/C++). Recent advances in large language models (LLMs) offer language-agnostic capabilities and strong semantic understanding, exhibiting potential to overcome multilingual vulnerability limitations. Although some work has begun to explore LLMs' repair performance, their effectiveness is unsatisfactory. To address these limitations, we conducted a large-scale empirical study to investigate the performance of automated vulnerability repair approaches and state-of-the-art LLMs across seven programming languages. Results show GPT-4o, instruction-tuned with few-shot prompting, performs competitively against the leading approach, VulMaster. Additionally, the LLM-based approach shows superior performance in repairing unique vulnerabilities and is more likely to repair the most dangerous vulnerabilities. Instruction-tuned GPT-4o demonstrates strong generalization on vulnerabilities in previously unseen language, outperforming existing approaches. Analysis shows Go consistently achieves the highest effectiveness across all model types, while C/C++ performs the worst. Based on findings, we discuss the promise of LLM on multilingual vulnerability repair and the reasons behind LLM's failed cases. This work takes the first look at repair approaches and LLMs across multiple languages, highlighting the promising future of adopting LLMs for multilingual vulnerability repair.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs for multilingual vulnerability repair across languages

Assessing GPT-4o's performance against VulMaster in vulnerability fixes

Analyzing LLM effectiveness and failure cases in diverse programming languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

Instruction-tuned GPT-4o for multilingual repair

Few-shot prompting enhances vulnerability fixing

LLMs outperform in dangerous vulnerability repair

🔎 Similar Papers

APPATCH: Automated Adaptive Prompting Large Language Models for Real-World Software Vulnerability Patching