Vulnerability Detection with Interprocedural Context in Multiple Languages: Assessing Effectiveness and Cost of Modern LLMs

📅 2026-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the limitation of current large language models (LLMs) in vulnerability detection, which typically focus on single-function analysis and struggle to capture interprocedural vulnerabilities arising from cross-function data and control flows. For the first time, it systematically evaluates four leading LLMs—Claude Haiku 4.5, GPT-4.1 Mini, GPT-5 Mini, and Gemini 3 Flash—on their ability to detect interprocedural vulnerabilities in C, C++, and Python using caller-callee contextual information. Based on 509 real-world vulnerabilities from the ReposVul dataset, experiments demonstrate that Gemini 3 Flash achieves an F1 score of at least 0.978 for C at a cost of $0.50–$0.58, while Claude Haiku 4.5 accurately identifies vulnerabilities and generates high-quality explanations in 93.6% of cases, confirming the effectiveness and cost-efficiency of leveraging interprocedural context.
📝 Abstract
Large Language Models (LLMs) have been a promising way for automated vulnerability detection. However, most prior studies have explored the use of LLMs to detect vulnerabilities only within single functions, disregarding those related to interprocedural dependencies. These studies overlook vulnerabilities that arise from data and control flows that span multiple functions. Thus, leveraging the context provided by callers and callees may help identify vulnerabilities. This study empirically investigates the effectiveness of detection, the inference cost, and the quality of explanations of four modern LLMs (Claude Haiku 4.5, GPT-4.1 Mini, GPT-5 Mini, and Gemini 3 Flash) in detecting vulnerabilities related to interprocedural dependencies. To do that, we conducted an empirical study on 509 vulnerabilities from the ReposVul dataset, systematically varying the level of interprocedural context (target function code-only, target function + callers, and target function + callees) and evaluating the four modern LLMs across C, C++, and Python. The results show that Gemini 3 Flash offers the best cost-effectiveness trade-off for C vulnerabilities, achieving F1 >= 0.978 at an estimated cost of $0.50-$0.58 per configuration, and Claude Haiku 4.5 correctly identified and explained the vulnerability in 93.6% of the evaluated cases. Overall, the findings have direct implications for the design of AI-assisted security analysis tools that can generalize across codebases in multiple programming languages.
Problem

Research questions and friction points this paper is trying to address.

vulnerability detection
interprocedural context
large language models
cross-function dependencies
multi-language security analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

interprocedural context
vulnerability detection
large language models
multi-language evaluation
cost-effectiveness
🔎 Similar Papers
No similar papers found.