Black-box, Adaptive, Efficient, Transferable, Harmful, Applicable... Attacks Are All You Need to Break LLMs

📅 2026-06-02
📈 Citations: 0
Influential: 0
📄 PDF

career value

230K/year
🤖 AI Summary
Current jailbreaking attacks against large language models (LLMs) lack standardized methodologies, leading to unreliable robustness evaluations and hindering fair comparisons of defense mechanisms. To address this gap, this work proposes the Indirect Harmful Optimization (IHO) framework, which trains a masked diffusion language model as a black-box attacker through iterative preference optimization. The resulting attacker requires no fine-tuning, exhibits strong cross-behavior and cross-model transferability, and is compatible with arbitrary defense pipelines. IHO represents the first unified jailbreaking approach that is simultaneously efficient, adaptive, and transferable, significantly outperforming state-of-the-art methods even under multi-layered defenses—such as Circuit Breaker combined with auxiliary detectors—and substantially increasing attack success rates. This framework establishes a reliable benchmark for LLM safety evaluation.
📝 Abstract
Accurately evaluating adversarial robustness is a longstanding challenge. A flawed attack design can inflate robustness estimates, making deployment risk assessment and defense comparison unreliable. Historically, standardized attacks such as AutoAttack have largely resolved this for image classifiers, providing a reliable evaluation baseline for systematic comparison across defenses. However, no equivalent exists for LLM jailbreak evaluation yet, where designing such an attack is considerably more difficult. A reliable attack must, among other things, be black-box compatible, applicable to arbitrary defense pipelines, and efficient, which no existing method jointly satisfies. We introduce Indirect Harm Optimization (IHO), a masked diffusion language model attacker trained via iterative preference optimization against a harmfulness judge, requiring only black-box access to the target. The same method can be used without modification as a strong adaptive attack on individual behaviors, or as an efficient amortized policy that transfers to held-out behaviors and unseen target models without fine-tuning. Even against layered defenses, such as a Circuit Breaker-trained model combined with an auxiliary detector, IHO improves attack success considerably over state-of-the-art approaches, without any defense-specific adaptation. Our results position IHO as a practical step toward the kind of standardized jailbreak evaluation that has improved reliability in the past. Code and models are available on GitHub and Hugging Face.
Problem

Research questions and friction points this paper is trying to address.

adversarial robustness
LLM jailbreak
standardized attack
black-box evaluation
defense comparison
Innovation

Methods, ideas, or system contributions that make the work stand out.

Indirect Harm Optimization
black-box attack
transferable jailbreak
adversarial robustness evaluation
amortized policy
🔎 Similar Papers
No similar papers found.
💼 Related Jobs
V
Vincent Limbach
Department of Computer Science, Technical University of Munich, Germany
J
Jonas Dornbusch
Department of Computer Science, Technical University of Munich, Germany; Munich Data Science Institute; Munich Center for Machine Learning
David Lüdke
David Lüdke
Technical University Munich
machine learninggenerative modeling
Stephan Günnemann
Stephan Günnemann
Professor of Computer Science, Technical University of Munich
Machine LearningGraphsGraph Neural NetworksRobustness
Leo Schwinn
Leo Schwinn
Technical University of Munich
Machine LearningDeep LearningAdversarial Attacks