Ocassionally Secure: A Comparative Analysis of Code Generation Assistants

📅 2024-02-01
🏛️ arXiv.org
📈 Citations: 9
Influential: 1
📄 PDF
🤖 AI Summary
This work addresses the safety and functional correctness of LLM-generated code in realistic software development scenarios. We systematically evaluate GPT-3.5, GPT-4, Bard, and Gemini across nine real-world programming tasks, measuring functional correctness, security, performance, complexity, and reliability—while contrasting security-aware versus security-agnostic developer prompts. Our key contributions include: (1) identifying and characterizing “sporadic security”—a phenomenon wherein minor contextual or stylistic prompt variations significantly impact code security; and (2) introducing the first multidimensional hybrid evaluation framework for practical development, integrating human annotation, static/dynamic analysis (Semgrep/Bandit), functional testing, and maintainability metrics. Results show that only 39% of generated code satisfies both functional correctness and absence of known vulnerabilities; GPT-4 achieves the highest security compliance rate (68%), yet all models exhibit >50% failure rates on cryptography and input validation tasks.

Technology Category

Application Category

📝 Abstract
$ $Large Language Models (LLMs) are being increasingly utilized in various applications, with code generations being a notable example. While previous research has shown that LLMs have the capability to generate both secure and insecure code, the literature does not take into account what factors help generate secure and effective code. Therefore in this paper we focus on identifying and understanding the conditions and contexts in which LLMs can be effectively and safely deployed in real-world scenarios to generate quality code. We conducted a comparative analysis of four advanced LLMs--GPT-3.5 and GPT-4 using ChatGPT and Bard and Gemini from Google--using 9 separate tasks to assess each model's code generation capabilities. We contextualized our study to represent the typical use cases of a real-life developer employing LLMs for everyday tasks as work. Additionally, we place an emphasis on security awareness which is represented through the use of two distinct versions of our developer persona. In total, we collected 61 code outputs and analyzed them across several aspects: functionality, security, performance, complexity, and reliability. These insights are crucial for understanding the models' capabilities and limitations, guiding future development and practical applications in the field of automated code generation.
Problem

Research questions and friction points this paper is trying to address.

Identifying conditions for secure code generation by LLMs
Comparing code generation capabilities across four advanced LLMs
Analyzing functionality, security and reliability of generated code
Innovation

Methods, ideas, or system contributions that make the work stand out.

Comparative analysis of four advanced LLMs
Assessed code generation using nine separate tasks
Analyzed outputs across functionality security performance complexity
🔎 Similar Papers
No similar papers found.
R
Ran Elgedawy
University of Tennessee, Knoxville
J
John Sadik
University of Tennessee, Knoxville
Senjuti Dutta
Senjuti Dutta
University of Tennessee, Knoxville
A
Anuj Gautam
University of Tennessee, Knoxville
Konstantinos Georgiou
Konstantinos Georgiou
PhD Researcher, School of Informatics, Aristotle University of Thessaloniki
Machine LearningData ScienceStatisticsSoftware Engineering
F
Farzin Gholamrezae
Georgia Institute of Technology
F
Fujiao Ji
University of Tennessee, Knoxville
K
Kyungchan Lim
University of Tennessee, Knoxville
Q
Qian Liu
University of Tennessee, Knoxville
S
Scott Ruoti
University of Tennessee, Knoxville