STELP: Secure Transpilation and Execution of LLM-Generated Programs

📅 2026-01-09

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the critical security risks posed by deploying code generated by large language models (LLMs) directly into production systems, as such code often contains bugs, vulnerabilities, or malicious content, and traditional manual review is ill-suited for automated pipelines. To this end, we propose STELP—the first security-oriented execution framework specifically designed for LLM-generated code—that integrates a secure transpiler, sandboxed execution, and combined static and dynamic analysis to enable fully automated, safe validation and reliable execution without human intervention. We also introduce the first human-annotated dataset of unsafe LLM-generated code and evaluate STELP on public benchmarks. Experimental results demonstrate that STELP significantly outperforms existing approaches in correctness, security, and latency, with particularly strong performance in safely executing high-risk code.

Technology Category

Application Category

📝 Abstract

Rapid evolution of Large Language Models (LLMs) has achieved major advances in reasoning, planning, and function-calling capabilities. Multi-agentic collaborative frameworks using such LLMs place them at the center of solving software development-related tasks such as code generation. However, direct use of LLM generated code in production software development systems is problematic. The code could be unstable or erroneous and contain vulnerabilities such as data poisoning, malicious attacks, and hallucinations that could lead to widespread system malfunctions. This prohibits the adoption of LLM generated code in production AI systems where human code reviews and traditional secure testing tools are impractical or untrustworthy. In this paper, we discuss safety and reliability problems with the execution of LLM generated code and propose a Secure Transpiler and Executor of LLM-Generated Program (STELP), capable of executing LLM-generated code in a controlled and safe manner. STELP secures autonomous production AI systems involving code generation, filling the critical void left by the impracticality or limitations of traditional secure testing methodologies and human oversight. This includes applications such as headless code generation-execution and LLMs that produce executable code snippets as an action plan to be executed in real time. We contribute a human-validated dataset of insecure code snippets and benchmark our approach on publicly available datasets for correctness, safety, and latency. Our results demonstrate that our approach outperforms an existing method by a significant margin, particularly in its ability to safely execute risky code snippets. Warning: This paper contains malicious code snippets that should be run with caution.

Problem

Research questions and friction points this paper is trying to address.

LLM-generated code

security

reliability

vulnerabilities

production systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

secure transpilation

LLM-generated code

code execution safety

autonomous AI systems

malicious code detection

🔎 Similar Papers

No similar papers found.

Authors to Follow