ShieldLearner: A New Paradigm for Jailbreak Attack Defense in LLMs

📅 2025-02-16

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Large language models (LLMs) remain vulnerable to adversarial jailbreaking attacks, while existing prompt-based defense methods suffer from limited adaptability, interpretability, and customization. To address these challenges, this paper proposes a human-like autonomous learning defense paradigm. Our approach introduces three key contributions: (1) a novel dual-module framework—Pattern Atlas and Meta-analysis Framework—that jointly enables attack pattern modeling and meta-rule synthesis; (2) Adaptive Adversarial Augmentation, a training-free mechanism supporting continuous self-evolution against emerging threats; and (3) a rigorously constructed hard test suite derived from Wildjailbreak, designed to evaluate robustness against stealthy, adaptive attacks. Experiments demonstrate that our method achieves statistically significant improvements in defense success rate over state-of-the-art baselines on both standard and hard benchmarks, reduces computational overhead by over 40%, and supports zero-shot fine-tuning and lightweight online deployment.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have achieved remarkable success in various domains but remain vulnerable to adversarial jailbreak attacks. Existing prompt-defense strategies, including parameter-modifying and parameter-free approaches, face limitations in adaptability, interpretability, and customization, constraining their effectiveness against evolving threats. To address these challenges, we propose ShieldLearner, a novel paradigm that mimics human learning in defense. Through trial and error, it autonomously distills attack signatures into a Pattern Atlas and synthesizes defense heuristics into a Meta-analysis Framework, enabling systematic and interpretable threat detection. Furthermore, we introduce Adaptive Adversarial Augmentation to generate adversarial variations of successfully defended prompts, enabling continuous self-improvement without model retraining. In addition to standard benchmarks, we create a hard test set by curating adversarial prompts from the Wildjailbreak dataset, emphasizing more concealed malicious intent. Experimental results show that ShieldLearner achieves a significantly higher defense success rate than existing baselines on both conventional and hard test sets, while also operating with lower computational overhead, making it a practical and efficient solution for real-world adversarial defense.

Problem

Research questions and friction points this paper is trying to address.

Defends LLMs against jailbreak attacks

Improves adaptability and interpretability of defenses

Reduces computational overhead in threat detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mimics human learning in defense

Autonomously distills attack signatures

Adaptive Adversarial Augmentation introduced

🔎 Similar Papers

SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner