ACSE-Eval: Can LLMs threat model real-world cloud infrastructure?

📅 2025-05-16

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Existing evaluations of large language models (LLMs) lack rigorous assessment of their capabilities in real-world cloud threat modeling. Method: We introduce ACSE-Eval—the first cloud-security-specific benchmark—comprising 100 production-grade AWS deployment scenarios, each including Infrastructure-as-Code (IaC) templates, architectural descriptions, verified vulnerabilities, and threat parameters. We propose a multidimensional evaluation framework covering threat identification, attack-path analysis, and mitigation recommendation, with novel metrics for accuracy, completeness, and operationalizability. We systematically evaluate leading LLMs under zero-shot and few-shot settings. Results: GPT-4.1 achieves the highest overall performance; Gemini 2.5 Pro excels in zero-shot threat identification; Claude 3.7 Sonnet demonstrates the richest semantic modeling capability. All components—including the dataset, evaluation metrics, and methodology—are fully open-sourced, establishing a new foundation for cloud-native security research and trustworthy LLM evaluation.

Technology Category

Application Category

📝 Abstract

While Large Language Models have shown promise in cybersecurity applications, their effectiveness in identifying security threats within cloud deployments remains unexplored. This paper introduces AWS Cloud Security Engineering Eval, a novel dataset for evaluating LLMs cloud security threat modeling capabilities. ACSE-Eval contains 100 production grade AWS deployment scenarios, each featuring detailed architectural specifications, Infrastructure as Code implementations, documented security vulnerabilities, and associated threat modeling parameters. Our dataset enables systemic assessment of LLMs abilities to identify security risks, analyze attack vectors, and propose mitigation strategies in cloud environments. Our evaluations on ACSE-Eval demonstrate that GPT 4.1 and Gemini 2.5 Pro excel at threat identification, with Gemini 2.5 Pro performing optimally in 0-shot scenarios and GPT 4.1 showing superior results in few-shot settings. While GPT 4.1 maintains a slight overall performance advantage, Claude 3.7 Sonnet generates the most semantically sophisticated threat models but struggles with threat categorization and generalization. To promote reproducibility and advance research in automated cybersecurity threat analysis, we open-source our dataset, evaluation metrics, and methodologies.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to identify cloud security threats

Assessing LLMs' performance in analyzing attack vectors in AWS

Comparing LLM models for threat modeling in cloud environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces AWS Cloud Security Engineering Eval dataset

Evaluates LLMs on cloud security threat modeling

Open-sources dataset and methodologies for reproducibility

🔎 Similar Papers

Risks of Practicing Large Language Models in Smart Grid: Threat Modeling and Validation