DecipherGuard: Understanding and Deciphering Jailbreak Prompts for a Safer Deployment of Intelligent Software Systems

📅 2025-09-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large language model (LLM) systems exhibit significant vulnerability to runtime obfuscation and template-based jailbreak attacks—e.g., LlamaGuard’s defense success rate drops by 24%. To address this, we propose a novel multi-layer runtime guard framework. Our method introduces: (1) a lightweight decryption layer that semantically reconstructs obfuscated prompts via token-level deobfuscation; and (2) dynamic decoding analysis integrated with LoRA fine-tuning to enhance generalization against unseen template attacks. Evaluated on over 22,000 adversarial prompts, our framework achieves a 36–65% improvement in jailbreak mitigation success rate compared to state-of-the-art baselines, and boosts overall protection performance by 20–50%. These gains substantially strengthen the robustness and security of LLM-driven applications under real-world adversarial conditions.

Technology Category

Application Category

📝 Abstract
Intelligent software systems powered by Large Language Models (LLMs) are increasingly deployed in critical sectors, raising concerns about their safety during runtime. Through an industry-academic collaboration when deploying an LLM-powered virtual customer assistant, a critical software engineering challenge emerged: how to enhance a safer deployment of LLM-powered software systems at runtime? While LlamaGuard, the current state-of-the-art runtime guardrail, offers protection against unsafe inputs, our study reveals a Defense Success Rate (DSR) drop of 24% under obfuscation- and template-based jailbreak attacks. In this paper, we propose DecipherGuard, a novel framework that integrates a deciphering layer to counter obfuscation-based prompts and a low-rank adaptation mechanism to enhance guardrail effectiveness against template-based attacks. Empirical evaluation on over 22,000 prompts demonstrates that DecipherGuard improves DSR by 36% to 65% and Overall Guardrail Performance (OGP) by 20% to 50% compared to LlamaGuard and two other runtime guardrails. These results highlight the effectiveness of DecipherGuard in defending LLM-powered software systems against jailbreak attacks during runtime.
Problem

Research questions and friction points this paper is trying to address.

Enhancing safety of LLM-powered software systems against jailbreak attacks
Addressing obfuscation- and template-based prompt attacks on runtime guardrails
Improving defense success rates for intelligent systems in critical deployments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates a deciphering layer for obfuscated prompts
Uses low-rank adaptation against template-based attacks
Improves defense success rate and overall guardrail performance
🔎 Similar Papers
No similar papers found.
R
Rui Yang
Monash University, Australia
Michael Fu
Michael Fu
The University of Melbourne
Software EngineeringDevSecOpsDeep LearningLanguage Models
C
Chakkrit Tantithamthavorn
Monash University, Australia
C
Chetan Arora
Monash University, Australia
G
Gunel Gulmammadova
Transurban, Australia
J
Joey Chua
Transurban, Australia