Harmful Prompt Laundering: Jailbreaking LLMs with Abductive Styles and Symbolic Encoding

📅 2025-09-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the security vulnerability of large language models (LLMs) to universal jailbreaking attacks. We propose a black-box, dual-strategy attack framework that requires no white-box access to model internals. Methodologically, it integrates abductive reasoning guidance—implicitly steering models toward generating harmful content—and lightweight variable-symbol encoding—dynamically obfuscating sensitive semantics to evade keyword-based detection. Crucially, the framework operates solely via prompt engineering, enabling semantic perturbation and inference-path manipulation without accessing or modifying model parameters. Evaluated on GPT-series models, it achieves over 95% attack success rate and an average cross-model success rate of 70%. These results expose fundamental weaknesses in current rule-based and supervised fine-tuning safety alignment mechanisms. The study empirically reveals a critical security gap between LLM inference processes and their symbolic representations, providing both key empirical evidence and reverse-engineering insights for developing robust alignment methods.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse tasks, but their potential misuse for harmful purposes remains a significant concern. To strengthen defenses against such vulnerabilities, it is essential to investigate universal jailbreak attacks that exploit intrinsic weaknesses in the architecture and learning paradigms of LLMs. In response, we propose extbf{H}armful extbf{P}rompt extbf{La}undering (HaPLa), a novel and broadly applicable jailbreaking technique that requires only black-box access to target models. HaPLa incorporates two primary strategies: 1) extit{abductive framing}, which instructs LLMs to infer plausible intermediate steps toward harmful activities, rather than directly responding to explicit harmful queries; and 2) extit{symbolic encoding}, a lightweight and flexible approach designed to obfuscate harmful content, given that current LLMs remain sensitive primarily to explicit harmful keywords. Experimental results show that HaPLa achieves over 95% attack success rate on GPT-series models and 70% across all targets. Further analysis with diverse symbolic encoding rules also reveals a fundamental challenge: it remains difficult to safely tune LLMs without significantly diminishing their helpfulness in responding to benign queries.
Problem

Research questions and friction points this paper is trying to address.

Investigating universal jailbreak attacks exploiting LLM weaknesses
Proposing black-box harmful prompt laundering technique
Addressing difficulty in tuning LLMs safely without reducing helpfulness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Abductive framing for harmful step inference
Symbolic encoding to obfuscate harmful keywords
Black-box jailbreaking with high success rates
🔎 Similar Papers
No similar papers found.