🤖 AI Summary
This study presents the first empirical evaluation of open-source large language model (LLM) agents in static application security testing (SAST) tasks. The authors construct three LLM-based agents using Ollama and benchmark their performance against Bandit, a mature SAST tool, in standard vulnerability detection scenarios. A quantitative comparison is conducted using a composite scoring metric based on precision, recall, and the number of false positives. The results demonstrate that current open-source LLM agents exhibit substantially lower overall performance than specialized SAST tools and are not yet viable replacements in real-world SAST applications. This work provides critical empirical evidence delineating the current applicability boundaries of LLMs in cybersecurity contexts.
📝 Abstract
This paper explores the value of agentic AI tools for cybersecurity purposes. We evaluate the efficacy of a general-purpose GenAI Large Language Model- (GenAI-) based agent when powered by three different Ollama-hosted general-purpose open source models. We assess each agent's performance using precision, recall, false positive count, and a calculated composite score based upon the interplay of the captured metrics, against the baseline performance of an existing, vetted Static Application Security Testing (SAST) tool, Bandit. Our findings refute the notion that a modern open-source GenAI LLM-based agent is currently suitable for the specialized task of SAST scanning under realistic conditions.