SWE-InfraBench: Evaluating Language Models on Cloud Infrastructure Code

📅 2026-06-03

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This study addresses the evaluation of large language models’ capability to perform incremental modifications in Infrastructure-as-Code (IaC) development within cloud environments, specifically focusing on enterprise scenarios that require editing existing AWS CDK code based on natural language instructions. To this end, we introduce the first benchmark tailored for imperative IaC tools, leveraging real-world code repositories and natural-language-driven incremental editing tasks, with correctness validated through automated testing. Experimental results reveal that even state-of-the-art models exhibit limited performance—e.g., Sonnet 3.7 achieves only a 34% success rate—highlighting the significant challenges posed by reasoning over complex cloud resource dependencies and implementation patterns. This work fills a critical gap in existing benchmarks, which lack support for incremental editing in imperative IaC contexts.

📝 Abstract

Building infrastructure-as-code (IaC) in cloud computing is a critical task, underpinning the reliability, scalability, and security of modern software systems. Despite the remarkable progress of large language models (LLMs) in software engineering -- demonstrated across many dedicated benchmarks -- their capabilities in developing IaC remain underexplored. Unlike existing IaC benchmarks that predominantly center on declarative paradigms such as Terraform and involve generating entire codebases from scratch, our benchmark reflects the incremental code edits common in enterprise development with imperative tools like the AWS CDK. We present SWE-InfraBench, a diverse evaluation dataset sourced from dozens of real-world IaC codebases that challenge LLMs to perform realistic code modifications in AWS CDK repositories. Each example requires models to implement changes to existing codebases based on natural language instructions, with success determined by passing provided test cases. These tasks demand sophisticated reasoning about cloud resource dependencies and implementation patterns beyond conventional code generation challenges. Our evaluation results reveal significant limitations in current LLMs showing that even state-of-the-art systems struggle with many tasks -- the best model, Sonnet 3.7, succeeds in only 34\% of cases, while specialized reasoning models like DeepSeek R1 achieve just 24% success. The SWE-InfraBench dataset is available at: https://www.kaggle.com/datasets/64e59070fd51c0278560b01eb5dc4f3c447d5268cdabe5a350d2969e4413fea5

Problem

Research questions and friction points this paper is trying to address.

Infrastructure-as-Code

Large Language Models

AWS CDK

Code Generation

Cloud Infrastructure

Innovation

Methods, ideas, or system contributions that make the work stand out.

Infrastructure-as-Code

AWS CDK

incremental code editing