Bypassing Prompt Injection and Jailbreak Detection in LLM Guardrails

📅 2025-04-15

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work exposes structural vulnerabilities in current large language model (LLM) safety guardrails—such as Azure Prompt Shield and Prompt Guard—against prompt injection and jailbreaking attacks. To systematically assess their robustness, we propose a novel attack paradigm that integrates character-level perturbations with adversarial evasion techniques from machine learning. Specifically, we leverage offline, white-box computation of token importance (e.g., gradients or attention weights) to guide black-box attacks—marking the first such approach. We implement a multi-system black-box evaluation framework and achieve up to 100% attack success rate (ASR) across six mainstream guardrail systems, significantly outperforming existing methods. Our findings demonstrate that state-of-the-art guardrails lack semantic and structural robustness, underscoring the urgent need for next-generation defenses that jointly preserve semantic integrity and structural resilience.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) guardrail systems are designed to protect against prompt injection and jailbreak attacks. However, they remain vulnerable to evasion techniques. We demonstrate two approaches for bypassing LLM prompt injection and jailbreak detection systems via traditional character injection methods and algorithmic Adversarial Machine Learning (AML) evasion techniques. Through testing against six prominent protection systems, including Microsoft's Azure Prompt Shield and Meta's Prompt Guard, we show that both methods can be used to evade detection while maintaining adversarial utility achieving in some instances up to 100% evasion success. Furthermore, we demonstrate that adversaries can enhance Attack Success Rates (ASR) against black-box targets by leveraging word importance ranking computed by offline white-box models. Our findings reveal vulnerabilities within current LLM protection mechanisms and highlight the need for more robust guardrail systems.

Problem

Research questions and friction points this paper is trying to address.

Bypassing LLM guardrail systems via evasion techniques

Testing vulnerabilities in six prominent protection systems

Enhancing attack success rates using offline white-box models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Character injection bypasses LLM guardrails

Adversarial Machine Learning evades detection

White-box models boost black-box attack success

🔎 Similar Papers

No similar papers found.