๐ค AI Summary
Existing LLM jailbreak defense methods overly rely on superficial textual patterns, resulting in poor generalizability. To address this, we propose the Essence-Driven Defense Framework (EDDF), the first approach to explicitly model and leverage the *semantic essence*โrather than surface-level promptsโof jailbreak attacks. EDDF employs contrastive learning to construct a curated attack-essence vector repository and introduces a two-stage, plug-and-play filtering mechanism: offline repository construction followed by online retrieval-based input purification. Evaluated across diverse jailbreak attacks, EDDF achieves an average reduction of over 20% in attack success rate while maintaining a low false-positive rate, significantly outperforming state-of-the-art defenses. Notably, it demonstrates strong robustness against unseen attack variants, underscoring its generalizability and practical utility.
๐ Abstract
Although Aligned Large Language Models (LLMs) are trained to refuse harmful requests, they remain vulnerable to jailbreak attacks. Unfortunately, existing methods often focus on surface-level patterns, overlooking the deeper attack essences. As a result, defenses fail when attack prompts change, even though the underlying"attack essence"remains the same. To address this issue, we introduce EDDF, an extbf{E}ssence- extbf{D}riven extbf{D}efense extbf{F}ramework Against Jailbreak Attacks in LLMs. EDDF is a plug-and-play input-filtering method and operates in two stages: 1) offline essence database construction, and 2) online adversarial query detection. The key idea behind EDDF is to extract the"attack essence"from a diverse set of known attack instances and store it in an offline vector database. Experimental results demonstrate that EDDF significantly outperforms existing methods by reducing the Attack Success Rate by at least 20%, underscoring its superior robustness against jailbreak attacks.