🤖 AI Summary
This paper addresses the low efficiency and unstable vulnerability discovery of large language models (LLMs) in automated red-teaming, particularly in attack prompt generation. We propose a discoverability-based prompt optimization method that quantifies exploitability by estimating the expected success rate of a single attack on the target system via multi-random-seed sampling; this metric then guides iterative prompt refinement. Attack Success Rate (ASR) serves as the core evaluation metric, enhanced by target-environment randomization and repeated sampling to improve assessment robustness. Our key contribution is the first formal modeling of discoverability as a measurable, optimizable prompt quality metric—eliminating reliance on manual annotations or fixed benchmarks. Experiments demonstrate significant improvements in vulnerability identification rate and cross-model generalization, thereby enhancing both the effectiveness and stability of automated red-teaming.
📝 Abstract
Applications that use Large Language Models (LLMs) are becoming widespread, making the identification of system vulnerabilities increasingly important. Automated Red Teaming accelerates this effort by using an LLM to generate and execute attacks against target systems. Attack generators are evaluated using the Attack Success Rate (ASR) the sample mean calculated over the judgment of success for each attack. In this paper, we introduce a method for optimizing attack generator prompts that applies ASR to individual attacks. By repeating each attack multiple times against a randomly seeded target, we measure an attack's discoverability the expectation of the individual attack success. This approach reveals exploitable patterns that inform prompt optimization, ultimately enabling more robust evaluation and refinement of generators.