Prompt Optimization and Evaluation for LLM Automated Red Teaming

📅 2025-07-29

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This paper addresses the low efficiency and unstable vulnerability discovery of large language models (LLMs) in automated red-teaming, particularly in attack prompt generation. We propose a discoverability-based prompt optimization method that quantifies exploitability by estimating the expected success rate of a single attack on the target system via multi-random-seed sampling; this metric then guides iterative prompt refinement. Attack Success Rate (ASR) serves as the core evaluation metric, enhanced by target-environment randomization and repeated sampling to improve assessment robustness. Our key contribution is the first formal modeling of discoverability as a measurable, optimizable prompt quality metric—eliminating reliance on manual annotations or fixed benchmarks. Experiments demonstrate significant improvements in vulnerability identification rate and cross-model generalization, thereby enhancing both the effectiveness and stability of automated red-teaming.

Technology Category

Application Category

📝 Abstract

Applications that use Large Language Models (LLMs) are becoming widespread, making the identification of system vulnerabilities increasingly important. Automated Red Teaming accelerates this effort by using an LLM to generate and execute attacks against target systems. Attack generators are evaluated using the Attack Success Rate (ASR) the sample mean calculated over the judgment of success for each attack. In this paper, we introduce a method for optimizing attack generator prompts that applies ASR to individual attacks. By repeating each attack multiple times against a randomly seeded target, we measure an attack's discoverability the expectation of the individual attack success. This approach reveals exploitable patterns that inform prompt optimization, ultimately enabling more robust evaluation and refinement of generators.

Problem

Research questions and friction points this paper is trying to address.

Optimizing prompts for LLM-based attack generation

Measuring attack discoverability via repeated executions

Improving robustness of automated red teaming evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimizes attack prompts using individual ASR

Measures attack discoverability via repeated executions

Reveals exploitable patterns for prompt refinement

🔎 Similar Papers

No similar papers found.