Safety Paradox: How Enhanced Safety Awareness Leaves LLMs Vulnerable to Posterior Attack

📅 2026-06-03

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Although large language models (LLMs) aligned for safety exhibit improved recognition of harmful content, they paradoxically become more vulnerable to novel jailbreak attacks—a phenomenon we term the “safety paradox.” This work proposes Posterior Attack, a method that induces models to bypass their internal safety classifiers and generate harmful outputs with just a single query. We formally characterize this paradox, proving that monotonic improvements in safety capabilities amplify posterior vulnerabilities, and establish a causal link through reinforcement learning–based intervention. Extensive experiments across 30 mainstream LLMs, including GPT-5 and Claude 4.6, reveal a significant negative correlation between a model’s safety proficiency and its susceptibility to such attacks. Notably, deliberately weakening the model’s safety judgment effectively confers immunity against Posterior Attack.

📝 Abstract

Large language models (LLMs) are rigorously aligned to refuse harmful requests, a process that inherently cultivates a latent capacity to evaluate and recognize unsafe content. In this work, we reveal that this advanced safety awareness inadvertently introduces a fatal vulnerability. We introduce Posterior Attack, a single-query jailbreak that bypasses guardrails by prompting the model to generate the exact harmful response its internal classifier would normally flag as unsafe. Through extensive empirical evaluation across 30 open-source LLMs (up to 35B parameters in size) and frontier models (e.g., GPT-5, Claude 4.6), we observe a striking phenomenon: models with superior safety-judgment capabilities are disproportionately more susceptible to this exploitation. To explain this, we formalize the Safety Paradox, analytically showing that monotonic improvements in safety alignment naturally amplify posterior vulnerability. Finally, we establish a causal link via reinforcement learning interventions, exemplifying that artificially degrading a model's safety judgment immunizes it against the attack, whereas enhancing judgment exacerbates the vulnerability. Our findings highlight potential flaws in current alignment paradigms, indicating that defense mechanisms may require further structural refinement.

Problem

Research questions and friction points this paper is trying to address.

Safety Paradox

Posterior Attack

LLM safety

alignment vulnerability

jailbreak

Innovation

Methods, ideas, or system contributions that make the work stand out.

Posterior Attack

Safety Paradox

LLM Alignment