Good News for Script Kiddies? Evaluating Large Language Models for Automated Exploit Generation

📅 2025-05-02

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This paper systematically evaluates the capabilities and limitations of large language models (LLMs) in automated exploit generation (AEG). Addressing critical issues in existing benchmarks—such as dataset bias and narrow evaluation dimensions—the authors propose the first dual-dimensional assessment framework jointly measuring *collaborativeness* (e.g., prompt responsiveness, debugging assistance) and *technical capability* (e.g., exploit correctness, reliability). They introduce a de-biased benchmark grounded in five re-engineered security experiments and design a reproducible, multi-turn LLM-driven attacker prompting paradigm. Experiments span leading closed- and open-weight models—including GPT-4, GPT-4o, and Llama3—employing structured prompting, code semantic reconstruction, and rigorous exploit validation. Results show that while GPT-4 and GPT-4o exhibit strong collaborativeness, they fail to produce functionally valid exploits; Llama3 demonstrates superior robustness against exploitation attempts. Crucially, no model passes strict exploit validity verification; GPT-4o achieves the lowest error rate, underscoring that LLM-driven AEG remains at a nascent, foundational stage.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in code-related tasks, raising concerns about their potential for automated exploit generation (AEG). This paper presents the first systematic study on LLMs' effectiveness in AEG, evaluating both their cooperativeness and technical proficiency. To mitigate dataset bias, we introduce a benchmark with refactored versions of five software security labs. Additionally, we design an LLM-based attacker to systematically prompt LLMs for exploit generation. Our experiments reveal that GPT-4 and GPT-4o exhibit high cooperativeness, comparable to uncensored models, while Llama3 is the most resistant. However, no model successfully generates exploits for refactored labs, though GPT-4o's minimal errors highlight the potential for LLM-driven AEG advancements.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs for automated exploit generation capabilities

Assessing cooperativeness and technical proficiency in AEG tasks

Mitigating dataset bias with refactored security lab benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introducing a benchmark with refactored security labs

Designing an LLM-based attacker for exploit generation

Evaluating GPT-4 and GPT-4o for exploit generation

🔎 Similar Papers

APPATCH: Automated Adaptive Prompting Large Language Models for Real-World Software Vulnerability Patching