Beyond Surface-Level Patterns: An Essence-Driven Defense Framework Against Jailbreak Attacks in LLMs

๐Ÿ“… 2025-02-26
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing LLM jailbreak defense methods overly rely on superficial textual patterns, resulting in poor generalizability. To address this, we propose the Essence-Driven Defense Framework (EDDF), the first approach to explicitly model and leverage the *semantic essence*โ€”rather than surface-level promptsโ€”of jailbreak attacks. EDDF employs contrastive learning to construct a curated attack-essence vector repository and introduces a two-stage, plug-and-play filtering mechanism: offline repository construction followed by online retrieval-based input purification. Evaluated across diverse jailbreak attacks, EDDF achieves an average reduction of over 20% in attack success rate while maintaining a low false-positive rate, significantly outperforming state-of-the-art defenses. Notably, it demonstrates strong robustness against unseen attack variants, underscoring its generalizability and practical utility.

Technology Category

Application Category

๐Ÿ“ Abstract
Although Aligned Large Language Models (LLMs) are trained to refuse harmful requests, they remain vulnerable to jailbreak attacks. Unfortunately, existing methods often focus on surface-level patterns, overlooking the deeper attack essences. As a result, defenses fail when attack prompts change, even though the underlying"attack essence"remains the same. To address this issue, we introduce EDDF, an extbf{E}ssence- extbf{D}riven extbf{D}efense extbf{F}ramework Against Jailbreak Attacks in LLMs. EDDF is a plug-and-play input-filtering method and operates in two stages: 1) offline essence database construction, and 2) online adversarial query detection. The key idea behind EDDF is to extract the"attack essence"from a diverse set of known attack instances and store it in an offline vector database. Experimental results demonstrate that EDDF significantly outperforms existing methods by reducing the Attack Success Rate by at least 20%, underscoring its superior robustness against jailbreak attacks.
Problem

Research questions and friction points this paper is trying to address.

Defend against jailbreak attacks in LLMs
Extract deeper attack essences
Reduce Attack Success Rate significantly
Innovation

Methods, ideas, or system contributions that make the work stand out.

Essence-Driven Defense Framework
Offline essence database construction
Online adversarial query detection
๐Ÿ”Ž Similar Papers
No similar papers found.
S
Shiyu Xiang
Sichuan University
A
Ansen Zhang
Shandong University
Y
Yanfei Cao
University of Science and Technology of China
Yang Fan
Yang Fan
University of Science and Technology of China
Learning to TeachAutomated Machine LearningNeural Architecture SearchNatural Language ProcessingAI for Medicine
R
Ronghao Chen
Peking University