๐ค AI Summary
This work addresses the critical security risks posed by deploying code generated by large language models (LLMs) directly into production systems, as such code often contains bugs, vulnerabilities, or malicious content, and traditional manual review is ill-suited for automated pipelines. To this end, we propose STELPโthe first security-oriented execution framework specifically designed for LLM-generated codeโthat integrates a secure transpiler, sandboxed execution, and combined static and dynamic analysis to enable fully automated, safe validation and reliable execution without human intervention. We also introduce the first human-annotated dataset of unsafe LLM-generated code and evaluate STELP on public benchmarks. Experimental results demonstrate that STELP significantly outperforms existing approaches in correctness, security, and latency, with particularly strong performance in safely executing high-risk code.
๐ Abstract
Rapid evolution of Large Language Models (LLMs) has achieved major advances in reasoning, planning, and function-calling capabilities. Multi-agentic collaborative frameworks using such LLMs place them at the center of solving software development-related tasks such as code generation. However, direct use of LLM generated code in production software development systems is problematic. The code could be unstable or erroneous and contain vulnerabilities such as data poisoning, malicious attacks, and hallucinations that could lead to widespread system malfunctions. This prohibits the adoption of LLM generated code in production AI systems where human code reviews and traditional secure testing tools are impractical or untrustworthy. In this paper, we discuss safety and reliability problems with the execution of LLM generated code and propose a Secure Transpiler and Executor of LLM-Generated Program (STELP), capable of executing LLM-generated code in a controlled and safe manner. STELP secures autonomous production AI systems involving code generation, filling the critical void left by the impracticality or limitations of traditional secure testing methodologies and human oversight. This includes applications such as headless code generation-execution and LLMs that produce executable code snippets as an action plan to be executed in real time. We contribute a human-validated dataset of insecure code snippets and benchmark our approach on publicly available datasets for correctness, safety, and latency. Our results demonstrate that our approach outperforms an existing method by a significant margin, particularly in its ability to safely execute risky code snippets. Warning: This paper contains malicious code snippets that should be run with caution.