Vulnerability Detection with Interprocedural Context in Multiple Languages: Assessing Effectiveness and Cost of Modern LLMs

📅 2026-04-09

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

This study addresses the limitation of current large language models (LLMs) in vulnerability detection, which typically focus on single-function analysis and struggle to capture interprocedural vulnerabilities arising from cross-function data and control flows. For the first time, it systematically evaluates four leading LLMs—Claude Haiku 4.5, GPT-4.1 Mini, GPT-5 Mini, and Gemini 3 Flash—on their ability to detect interprocedural vulnerabilities in C, C++, and Python using caller-callee contextual information. Based on 509 real-world vulnerabilities from the ReposVul dataset, experiments demonstrate that Gemini 3 Flash achieves an F1 score of at least 0.978 for C at a cost of $0.50–$0.58, while Claude Haiku 4.5 accurately identifies vulnerabilities and generates high-quality explanations in 93.6% of cases, confirming the effectiveness and cost-efficiency of leveraging interprocedural context.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have been a promising way for automated vulnerability detection. However, most prior studies have explored the use of LLMs to detect vulnerabilities only within single functions, disregarding those related to interprocedural dependencies. These studies overlook vulnerabilities that arise from data and control flows that span multiple functions. Thus, leveraging the context provided by callers and callees may help identify vulnerabilities. This study empirically investigates the effectiveness of detection, the inference cost, and the quality of explanations of four modern LLMs (Claude Haiku 4.5, GPT-4.1 Mini, GPT-5 Mini, and Gemini 3 Flash) in detecting vulnerabilities related to interprocedural dependencies. To do that, we conducted an empirical study on 509 vulnerabilities from the ReposVul dataset, systematically varying the level of interprocedural context (target function code-only, target function + callers, and target function + callees) and evaluating the four modern LLMs across C, C++, and Python. The results show that Gemini 3 Flash offers the best cost-effectiveness trade-off for C vulnerabilities, achieving F1 >= 0.978 at an estimated cost of $0.50-$0.58 per configuration, and Claude Haiku 4.5 correctly identified and explained the vulnerability in 93.6% of the evaluated cases. Overall, the findings have direct implications for the design of AI-assisted security analysis tools that can generalize across codebases in multiple programming languages.

Problem

Research questions and friction points this paper is trying to address.

vulnerability detection

interprocedural context

large language models

cross-function dependencies

multi-language security analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

interprocedural context

vulnerability detection

large language models