Beyond Surface-Level Patterns: An Essence-Driven Defense Framework Against Jailbreak Attacks in LLMs

📅 2025-02-26

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Existing LLM jailbreak defense methods overly rely on superficial textual patterns, resulting in poor generalizability. To address this, we propose the Essence-Driven Defense Framework (EDDF), the first approach to explicitly model and leverage the *semantic essence*—rather than surface-level prompts—of jailbreak attacks. EDDF employs contrastive learning to construct a curated attack-essence vector repository and introduces a two-stage, plug-and-play filtering mechanism: offline repository construction followed by online retrieval-based input purification. Evaluated across diverse jailbreak attacks, EDDF achieves an average reduction of over 20% in attack success rate while maintaining a low false-positive rate, significantly outperforming state-of-the-art defenses. Notably, it demonstrates strong robustness against unseen attack variants, underscoring its generalizability and practical utility.

Technology Category

Application Category

📝 Abstract

Although Aligned Large Language Models (LLMs) are trained to refuse harmful requests, they remain vulnerable to jailbreak attacks. Unfortunately, existing methods often focus on surface-level patterns, overlooking the deeper attack essences. As a result, defenses fail when attack prompts change, even though the underlying"attack essence"remains the same. To address this issue, we introduce EDDF, an extbf{E}ssence- extbf{D}riven extbf{D}efense extbf{F}ramework Against Jailbreak Attacks in LLMs. EDDF is a plug-and-play input-filtering method and operates in two stages: 1) offline essence database construction, and 2) online adversarial query detection. The key idea behind EDDF is to extract the"attack essence"from a diverse set of known attack instances and store it in an offline vector database. Experimental results demonstrate that EDDF significantly outperforms existing methods by reducing the Attack Success Rate by at least 20%, underscoring its superior robustness against jailbreak attacks.

Problem

Research questions and friction points this paper is trying to address.

Defend against jailbreak attacks in LLMs

Extract deeper attack essences

Reduce Attack Success Rate significantly

Innovation

Methods, ideas, or system contributions that make the work stand out.

Essence-Driven Defense Framework

Offline essence database construction

Online adversarial query detection

🔎 Similar Papers

Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks