JailbreakOPT: Tool-Assisted Iterative Jailbreak Prompt Optimization

📅 2026-06-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing single-turn jailbreak attacks struggle to balance expressiveness and adaptability: handcrafted prompts lack generalization, while iterative methods based on low-level perturbations suffer from poor query efficiency. This work proposes JailbreakOPT, a novel framework that uniquely integrates an atomic jailbreak prompt toolkit with a contextual multi-armed bandit mechanism. By formulating intra-turn optimization as a unified abstraction, JailbreakOPT dynamically composes toolkit elements to generate potent standalone attack prompts, while leveraging contextual Thompson sampling to reuse experience across turns. The approach substantially improves attack success rate (ASR) and reduces the number of queries needed for success (No.A), demonstrating both high efficiency and broad applicability across multiple large language models and target scenarios.
📝 Abstract
Jailbreak attacks expose persistent safety weaknesses in large language models (LLMs), but existing stateless single-turn methods face a trade-off: hand-crafted prompts are expressive but static, while iterative prompt optimization can adapt but often relies on low-level mutations that require many target queries. We propose JailbreakOPT, a tool-assisted framework for improving iterative single-turn jailbreak prompt optimization. JailbreakOPT organizes diverse atomic jailbreak prompts into an attack tool library and composes them through a unified intra-episode optimization abstraction to generate stronger standalone attack prompts. To reuse experience across attack episodes, JailbreakOPT further frames tool selection as a contextual bandit problem and applies contextual Thompson sampling to guide exploration and exploitation based on past outcomes. Experiments across multiple target LLMs and attack goals show that JailbreakOPT improves attack success rate (ASR) while reducing the number of attacks until success (No.A) compared with atomic single-turn attacks and existing iterative optimization baselines. This paper may contain offensive or harmful content.
Problem

Research questions and friction points this paper is trying to address.

jailbreak attacks
large language models
prompt optimization
single-turn attacks
safety weaknesses
Innovation

Methods, ideas, or system contributions that make the work stand out.

jailbreak prompt optimization
tool-assisted framework
contextual bandit
Thompson sampling
atomic jailbreak prompts
Ge Shi
Ge Shi
Beijing Institute of Technology
Natural Language ProcessingInformation Extraction
J
Jun Yin
The Renmin University of China
D
Donglin Xie
Independent Researcher
F
Fangyi Liu
Nankai University
Y
Yucan Li
Cornell University
M
Menglin Liu
The Chinese University of Hong Kong, Shenzhen