Jailbreaking LLMs Without Gradients or Priors: Effective and Transferable Attacks

📅 2026-01-06

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This work addresses the limitations of existing safety evaluation methods for large language models, which often rely on handcrafted priors or gradient information and thus fail to accurately assess robustness in black-box settings. To overcome this, the authors propose RAILS, a framework that conducts efficient adversarial attacks using only model logits, without requiring gradients or prior knowledge. Its key innovations include an autoregressive loss function that enforces exact prefix matching, a history-based token selection strategy that bridges the gap between surrogate objectives and actual attack success rates, and the first-ever cross-tokenizer ensemble attack mechanism, which substantially enhances transferability to closed-source models. Experiments demonstrate that RAILS achieves near-perfect attack success rates on multiple open-source models and exhibits strong black-box transferability against proprietary systems such as GPT and Gemini.

Technology Category

Application Category

📝 Abstract

As Large Language Models (LLMs) are increasingly deployed in safety-critical domains, rigorously evaluating their robustness against adversarial jailbreaks is essential. However, current safety evaluations often overestimate robustness because existing automated attacks are limited by restrictive assumptions. They typically rely on handcrafted priors or require white-box access for gradient propagation. We challenge these constraints by demonstrating that token-level iterative optimization can succeed without gradients or priors. We introduce RAILS (RAndom Iterative Local Search), a framework that operates solely on model logits. RAILS matches the effectiveness of gradient-based methods through two key innovations: a novel auto-regressive loss that enforces exact prefix matching, and a history-based selection strategy that bridges the gap between the proxy optimization objective and the true attack success rate. Crucially, by eliminating gradient dependency, RAILS enables cross-tokenizer ensemble attacks. This allows for the discovery of shared adversarial patterns that generalize across disjoint vocabularies, significantly enhancing transferability to closed-source systems. Empirically, RAILS achieves near 100% success rates on multiple open-source models and high black-box attack transferability to closed-source systems like GPT and Gemini.

Problem

Research questions and friction points this paper is trying to address.

jailbreaking

large language models

adversarial attacks

transferability

robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

jailbreak attack

gradient-free optimization

transferable adversarial attacks